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Abstrac+ 

The  process  of  effectively  coordinating  and  controlling  resources  during  a  mil¬ 
itary  engagement  is  known  as  battle  management/command,  control,  and  communi¬ 
cations  (BM/C3).  One  key  task  of  BM/C3  is  allocating  weapons  to  destroy  targets. 
The  focus  of  this  research  is  on  developing  parallel  methods  to  achieve  fast  and  cost 
effective  assignment  of  weapons  to  targets.  Using  the  sequential  Hungarian  method 
for  solving  the  assignment  problem  as  a  basis,  this  report  presents  the  development 
and  relative  performance  comparison  of  four  parallel  assignment  algorithms  imple¬ 
mented  on  the  Intel  iPSC  hypercube  computer. 

The  first  approach  partitions  the  problem  space  into  smaller,  independent  sub¬ 
problems  and  assigns  each  to  a  processing  node  in  the  hypercube.  The  second  and 
third  approaches  also  partition  the  problem  space,  but  they  assign  each  partition  to  a 
group  of  processing  nodes.  Each  group  is  controlled  by  a  separate  node  which  further 
subdivides  the  partition  among  members  of  the  group.  In  the  second  approach,  the 
control  node  acts  as  an  arbitrator  to  eliminate  the  redundant  assignment  of  weapons 
to  targets  by  idling  redundantly  allocated  weapons.  The  third  approach  eliminates 
redundant  weapon  allocations  by  selecting  the  least  costly  redundant  allocations  and 
directing  additional  processing  to  reallocate  the  more  costly  weapons.  The  fourth 
approach  is  a  parallel  implementation  of  the  Hungarian  algorithm,  where  certain 
subtasks  are  performed  in  parallel.  This  approach  produces  an  optimal  assignment 
instead  of  the  sub-optimal  assignment  generally  obtained  using  either  of  the  three 
heuristic  approaches. 

The  relative  performance  of  the  four  approaches  is  compared  by  varying  the 
number  of  weapons  and  targets,  the  number  of  processors  used,  and  the  size  of  the 
problem  partitions.  The  first  and  second  approaches  produce  assignment  solutions 
significantly  faster  than  the  baseline  sequential  methods.  The  third  and  fourth  ap- 
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proaches  yield  slower  solutions,  but  are  faster  than  sequential  methods  of  assignment. 
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IMPLEMENTATION  AND  PERFORMANCE  ANALYSIS 
OF  PARALLEL  ASSIGNMENT  ALGORITHMS 
ON  A  HYPERCUBE  COMPUTER 


C 


1.  Introduction 

Parallel  processing  is  a  method  of  computation  that  exploits  the  concurrent 
events  that  occur  in  the  solution  of  many  different  problems  [HwB84].  Parallel  com¬ 
puters  employing  multiple  processors  exploit  these  concurrent  events  by  assigning 
each  event  to  a  different  processor  for  simultaneous  processing.  The  results  of  these 
parallel  computations  are  combined  to  form  a  solution  to  the  overall  problem  [Hil87], 
Parallel  processing  is  presently  the  subject  of  intense  research  and  development.  The 
main  reason  for  the  increased  interest  in  parallel  processing  is  the  wider  availability 
of  parallel  multiprocessor  computers  [Fre86j.  Improved  technology  in  the  areas  of 
VLSI  (Very  Large  Scale  Integration)  circuits,  high  speed  communications,  and  hard 
ware  packaging  have  combined  to  make  these  parallel  computers  more  reliable  and 
much  less  expensive  [Sei85,  Den86,  Fre86|. 

Recent  software  implementations  have  shown  that  significant  reductions  in 
processing  times  are  possible  using  parallel  processing  [Qui87],  Many  of  these  im¬ 
plementations  involve  large  scale  problems  in  areas  such  as  fluid  dynamics  [EbB86], 
high  energy  physics  [Fox84],  partial  differential  equation  solutions  [SaN85],  statisti¬ 
cal  mechanics  [Fo084],  image  processing  [MuA87],  and  several  other  areas  that  were 
previously  not  feasible  because  of  the  excessive  processing  times  required  when  using 
single- processor  computers.  Faster  solutions  to  these  large  scale  problems  appeal  to 
many  researchers  in  government  and  industry  because  they  allow  more  accurate  and 
extensive  modeling  of  complex  processes  during  the  development  and  design  phases 


of  new  systems.  One  particular  government  organization  with  a  keen  interest  in  the 
increased  processing  speeds  provided  by  parallel  processing  is  the  Strategic  Defense 
Initiative  Organization  (SDIO)  [AdW85,  Lin85,  BoR85]. 


1.1  SDI  And  Parallel  Computing 

The  Strategic  Defense  Initiative  (SDI)  was  launched  by  President  Reagan  in 
a  televised  speech  on  March  23,  1983.  In  this  speech,  he  challenged  scientists  and 
engineers  to  work  to  render  nuclear  weapons  “impotent  and  obsolete.”  He  proposed 
a  research  and  development  program  to  determine  if  a  “smart”  system  of  nonnuclear 
defense  could  effectively  knock  out  incoming  offensive  ballistic  missiles  before  they 
detonate  over  our  country  [Rea83],  If  all  dollar  amounts  are  adjusted  to  today's 
value,  the  SDI  is  potentially  the  most  expensive  research  and  development  program 
ever  attempted  and  far  more  expensive  than  the  Manhatten  Project  which  produced 
the  atomic  bomb  [AdF85]. 

The  overall  system  architecture  of  the  SDI  system  is  envisioned  as  one  of  sev¬ 
eral  defensive  layers  corresponding  to  the  different  phases  that  occur  in  the  trajectory 
of  a  ballistic  missile.  Those  phases  are  the  boost  phase,  the  midcourse  phase,  and 
the  reentry  or  terminal  phase  [DrF85].  Within  each  defensive  layer,  computers  will 
use  information  gathered  from  sensors  to  detect,  classify,  and  track  potential  tar¬ 
gets.  Using  this  information  and  predefined  engagement  strategies,  weapons  will  be 
assigned  to  destroy  certaan  high-threat  targets.  After  firing  on  assigned  targets,  the 
effectiveness  of  the  weapons  would  be  evaluated  and  used  to  make  future  weapon 
engagement  decisions.  The  combination  of  all  of  these  processes  is  known  as  battle 
management /command,  control  and  communication  or  BM/C3  [SeD85,  Lin85]. 

Many  prominent  scientists  argue  that  the  realization  of  a  reliable  defense  sys¬ 
tem  of  the  magnitude  that  will  be  required  by  the  SDI  is  not  possible  [Lin85.  Par85, 
Noz86].  Although  development  in  the  areas  of  laser  beam,  particle  beam,  and  kinetic 
energy  weapons  is  still  in  the  beginning  stages,  preliminary  results  are  promising. 
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A  major  issue  with  these  weapons  is  providing  them  sufficient  energy  for  effective 
operation  when  they  are  deployed  in  space  [AdF85],  Development  of  the  BM/C3 
system  is  the  area  of  most  concern.  During  a  full-scale  missile  attack,  hundreds  of 
thousands  of  interrelated  decisions  will  need  to  be  made  about  how  to  most  effec¬ 
tively  utilize  available  defensive  weapons.  These  complex  decisions  must  be  made 
within  milliseconds  of  each  other  in  order  to  deploy  defensive  weapons  in  a  timely 
manner.  Because  time  and  complexity  constraints  make  them  humanly  impossible, 
these  decisions  must  be  made  with  the  assistance  of  fast  and  reliable  computers  using 
intelligent  softwaxe.  For  example,  if  the  enemy  launched  1400  missiles  in  an  attack, 
then  more  than  10  enemy  missile  kills  per  second  would  be  needed  to  destroy  most  of 
the  missiles  shortly  after  they  were  launched  [AdW85|.  Development  of  the  millions 
of  lines  of  error-free  software  code  and  the  computer  systems  to  flawlessly  execute 
the  software  to  accomplish  these  BM/C3  tasks  is  viewed  as  impossible  by  Parnas 
[Par85],  The  magnitude  and  complexity  of  BM/C3  software  prompted  Lieutenant 
General  James  A.  Abrahamson,  director  of  the  SDIO,  to  state  in  an  interview  that 
the  “incredible  software  problem”  of  the  battle  management  system  is  “the  challenge 
of  all  time”  [Chr85],  The  SDIC  is  now  actively  conducting  research  in  many  areas 
on  how  to  meet  the  challenge  of  developing  a  viable  battle  management  system. 

An  area  of  particular  interest  is  the  development  of  fast  and  reliable  BM/C3 
computer  systems  for  controlling  weapons,  sensors,  and  other  equipment  that  will 
comprise  the  SDI  system  [AdW85].  One  concern  is  the  computation  time  that  a 
single-processor  computer  might  require  to  control  and  coordinate  all  of  the  activ¬ 
ities  within  a  defensive  layer.  The  basic  computational  speed  of  a  single  proces¬ 
sor  is  limited  by  internal  signal  propagation  delays  and  is  not  expected  to  exceed 
1  GFLOPS  (Giga-Floating  Point  Operations  Per  Second)  with  current  circuit  tech¬ 
nology  [Den86].  Estimates  of  the  computational  speed  required  for  some  BM/C3 
tasks  are  more  than  10  GFLOPS  [AdW85].  One  way  researchers  believe  faster 
computations  will  be  possible  is  to  develop  system  architectures  that  utilize  parallcl- 


processors  [SeD85].  The  defensive  layer  could  then  be  divided  into  relatively  inde¬ 
pendent  regions.  Each  region  would  be  assigned  to  a  separate  set  of  processors  within 
the  multiprocessor  computer  to  coordinate  and  evaluate  activities  within  that  region. 
When  combined  with  efficient  software  developed  especially  for  parallel-processors, 
the  overall  computations  could  be  completed  in  a  time  much  shorter  than  that  achiev¬ 
able  with  a  single-processor  computer  [San87].  The  ideal  speed  increase  or  speedup 
of  a  parallel-processor  with  n  processors  over  a  single-processor  computer  is  n.  In 
some  cases,  greater  than  n  speedup  can  be  achieved  by  utilizing  certain  parallel  al¬ 
gorithms.  The  possibility  of  ideal  or  better  speedups  with  parallel  computers  creates 
the  potential  for  meeting  or  exceeding  the  predicted  computational  requirements  of 
the  proposed  BM/C3  system. 

1.2  The  Assignment  Problem 

One  of  the  critical  BM/C3  tasks  is  the  assignment  of  weapons  to  targets. 
Situations  similar  to  the  problem  of  assigning  weapons  to  targets  frequently  occur 
in  other  areas  such  as  operations  research,  logistics  management,  and  even  in  a 
computer’s  internal  management  of  its  resources.  Typically,  there  exists  a  number 
of  resources  available  to  be  allocated  to  a  number  of  requesters.  In  most  cases, 
there  are  more  requesters  than  there  are  resources.  In  cases  such  as  these,  decisions 
must  be  made  as  to  which  requesters  are  allocated  resources  and  which  requesters  are 
denied  resources.  The  problem  is  generally  known  in  the  literature  as  the  assignment 
problem  and  usually  involves  allocating  available  resources  to  competing  requesters 
in  such  a  way  as  to  maximize  some  measure  of  profit  or  award,  or  to  minimize  some 
measure  of  penalty  [Kuh55,  Chu57,  Kur62]. 

The  assignment  problem  can  be  solved  in  many  different  ways.  The  brute  force 
method  would  be  to  enumerate  all  the  possible  ways  resources  could  be  allocated  to 
requesters  and  then  choose  the  combination  that  provides  the  best  allocation.  This 
method  might  work  well  for  a  very  small  number  of  resources  and  requesters,  but  for 


any  realistically  sized  system,  the  time  required  to  enumerate  all  of  the  possibilities 
would  be  prohibitive.  For  example,  if  there  were  only  20  resources  and  20  requesters, 
the  number  of  different  resource-to-requester  assignments  would  be  20!  or  2.433  x 
1018  [Chu57],  This  difficult  problem  has  been  recognized  by  mathematicians  and 
computer  scientists  who  have  developed  algorithms  that  provide  more  time  efficient 
methods  of  arriving  at  the  best,  or  very  close  to  the  best,  allocation  of  resources. 

Research  on  developing  algorithms  to  solve  the  assignment  problem  has  a  long 
history.  Von  Neumann,  who  is  considered  the  inventor  of  the  conventional  single¬ 
processor  computers  used  today,  experimented  with  the  computational  advantages 
of  using  linear  programming  techniques  to  solve  the  assignment  problem  [Kuh55], 
Several  others  have  also  conducted  research,  developed  algorithms,  and  devised  soft 
ware  implementations  to  achieve  faster  and  more  efficient  methods  of  solving  the 
assignment  problem  [  Mun57,  Kur62,  LaM69,  SrT72,  SrT73,  Hat75,  Hun83,  McG83. 
MaN’86].  The  techniques  involved  with  many  of  these  research  efforts  are  similar  and 
involve  linear  programming,  graph  theory,  and  set  theory. 

1.3  Research  Objectives 

Although  algorithms  have  been  developed  to  solve  the  assignment  problem,  all 
of  them  have  been  implemented  as  sequential  processes.  Because  these  algorithms 
are  sequential  in  nature,  they  are  easily  implemented  on  sequential,  single-processor 
computers.  Unfortunately,  algorithms  that  solve  the  assignment  problem  in  a  parallel 
processing  environment  have  not  yet  been  developed.  Given  the  potential  speedups 
possible  with  parallel  computers,  it  would  seem  advantageous  for  the  battle  manage¬ 
ment  portion  of  the  SDI  system  to  use  a  parallelized  version  of  one  of  these  sequen¬ 
tial  algorithms  to  perform  the  weapon-target  assignment  task.  This  research  first 
investigates  the  techniques  for  mapping  algorithms  onto  parallel- processors.  Then, 
sequential  assignment  algorithms  are  analyzed  to  select  a  candidate  for  paralleliza 
tion. 
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The  primary  objective  of  this  thesis  investigation  is  implementation  of  assign¬ 
ment  algorithms  on  a  parallel  multiprocessor  computer.  After  successful  implemen¬ 
tation,  the  performance  of  the  parallel  algorithms  is  analyzed.  In  this  analysis, 
particular  attention  is  focused  on  the  effects  of  inter-processor  communications,  load 
balancing  among  processors,  execution  times,  and  machine  size  to  problem  size  re¬ 
lationships.  The  parallel  computer  used  for  the  implementations  is  the  Intel  iPSC 
(Intel  Personal  Super  Computer)  multiprocessor  system  which  is  described  in  detail 
in  Chapter  2. 

1.4  Scope 

In  this  study,  the  problem  of  assigning  weapons  to  targets  in  a  parallel  pro¬ 
cessing  environment  is  the  primary  focus.  For  this  reason,  exact  details  of  the  battle 
management  system  such  as  how  the  individual  targets  are  detected  and  tracked: 
the  specifics  of  particular  weapons;  the  operation  and  sensitivity  of  sensor  devices: 
and  the  three-dimensional  and  rotational  characteristics  of  weapon-to- target  geom¬ 
etry  are  not  addressed.  These  factors  are  accounted  for  to  a  certain  degree  by  using 
techniques  described  in  Section  1.6  (Assumptions).  However,  the  concepts  that  are 
explored  in  this  study  should  contribute  to  the  research  and  development  of  future 
battle  management  systems.  The  specific  steps  of  this  research  are  as  follows: 

1.  First,  techniques  for  partitioning  and  mapping  sequential  algorithms  onto 
parallel  computer  architectures  are  researched.  From  the  candidate  techniques,  one 
is  chosen  that  best  matches  the  loosely  coupled  architecture  of  the  Intel  hypercube. 

2.  The  study  continues  by  locating  efficient  sequential  algorithms  that  solve 
the  assignment  problem.  These  algorithms  are  evaluated  to  determine  which  ones, 
if  anv.  lend  themselves  to  parallel  implementation. 

3.  Using  the  chosen  algorithm  and  mapping  technique,  weapon-target  assign 
ment  programs  are  designed  and  then  implemented  using  the  parallel  program 


ming  language  supported  by  the  Intel  iPSC.  A  top-down,  structured  approach  to 
software  development  is  used  to  minimize  the  time  required  for  implementation. 

4.  The  implementations  are  tested  on  the  Intel  hypercube  machine  using  differ¬ 
ent  numbers  of  processors  and  varying  processor  configurations.  The  performance  is 
measured  with  varying  numbers  of  weapons  and  potential  targets  to  generate  ample 
data  for  analysis  and  comparison  of  the  different  implementations. 

1.5  Assumptions 

A  number  of  simplifying  assumptions  were  necessary  in  order  to  both  limit 
the  detail  of  the  research  to  a  reasonable  level  and  still  allow  time  for  completion. 
First,  the  number  and  location  of  potential  targets,  along  with  their  relative  impor¬ 
tance,  are  assumed  to  be  available  on  demand.  Likewise,  the  number  and  status  of 
available  resources  or  weapons  are  also  assumed  to  be  immediately  available  when 
requested.  Problems  associated  with  detecting  and  classifying  potential  targets,  and 
the  details  of  evaluating  the  effectiveness  of  weapons  already  assigned  to  targets  are 
not  considered,  although  simulated  results  of  those  functions  are  supplied  as  input 
data  to  the  programs.  Weapons  are  considered  to  be  reuseable  with  a  finite  numl  er 
of  “shots,”  and  are  assignable  to  one  target  at  a  time  for  a  single  “shot.”  Each 
instance  of  assignment  is  assumed  to  be  one  “snapshot”  of  the  dynamic  process  of 
missiles  in  some  phase  of  their  trajectory. 


Because  the  main  focus  of  this  study  is  on  the  implementation  of  a  paral¬ 
lel  weapon-target  assignment  algorithm,  an  entirely  realistic  simulation  of  missile 
trajectories  and  distribution  patterns  of  missiles  within  the  different  regions  is  not 
attempted.  However,  plausible  missile  attack  scenarios  are  generated  by  an  unclas¬ 
sified  ballistic  missile  defense  simulation  program.  These  scenarios  are  used  as  a 
basis  for  constructing  similar  data  as  input  to  the  implementations  developed  in  this 
study  Factors  such  as  space-based  weapons  platform  orbits,  rotation  of  the  earth, 
and  plausible  missile  trajectories  originating  from  locations  in  the  Soviet  Union  art- 
accounted  for  in  the  simulation  program  [Odo85]. 


1.6  Overview  of  the  Thesis 

This  chapter  completes  a  brief  overview  of  the  SDI.  parallel  computing,  and 
general  assignment  problems.  The  objectives  of  this  research  were  presented  along 
with  the  scope,  assumptions,  and  the  general  approach  to  be  taken  to  reach  the 
stated  objectives  The  remainder  of  this  thesis  develops  in  detail  the  steps  listed  in 
Section  1.5.  Chapter  2  begins  with  a  brief  survey  of  the  different  types  of  parallel 
computers  and  then  uses  the  survey  as  a  basis  for  describing  the  Intel  iPSC  parallel 
computer.  It  continues  w-ith  an  investigation  of  the  techniques  for  developing  parallel 
software  implementations  for  the  Intel  iPSC  and  concludes  with  a  summary  of  the 
techniques  selected  for  use.  In  Chapter  3,  a  thorough  presentation  of  assignment 
algorithms  developed  in  the  past  three  decades  is  presented.  Then,  development  of 
the  parallel  assignment  algorithm  begins  by  using  the  techniques  selected  in  Chap¬ 
ter  2  and  any  useable  portion  of  the  assignment  algorithms  developed  by  others  in 
the  past.  Chapter  4  begins  with  a  detailed  definition  of  the  experimental  model, 
a  presentation  of  the  ballistic  missile  defense  (BMD)  simulation  program  and  a  de¬ 
scription  of  a  method  for  generating  target  scenario  data  using  the  BMD  simulator 
as  a  basis.  Chapter  4  continues  with  a  description  of  the  different  implementations 
of  the  parallel  assignment  algorithm.  In  Chapter  5,  the  method  of  testing  and  data 
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acquisition  is  explained  first,  followed  by  a  presentation  of  the  results  obtained  from 
performance  runs  on  the  Intel  iPSC.  After  presenting  the  results,  detailed  analy¬ 
ses  of  these  results  are  performed.  Chapter  6  ends  the  thesis  with  conclusions  and 
recommendations  for  further  studv. 


-  y.  .S  .N 


2.  Parallel  Processing  Background 


Chapter  1  briefly  introduced  the  subject  of  parallel  processing.  This  chapter 
continues  with  a  more  in-depth  discussion  of  parallel  processing  by  first  surveying 
the  different  types  of  parallel- processor  architectures  and  then  focusing  on  a  particu¬ 
lar  class  of  architecture  known  as  Multiple  Instruction-stream,  Multiple  Data-stream 
(MIMD).  Then  the  history  of  development,  the  hardware,  and  some  of  the  important 
features  of  the  Intel  iPSC  hypercube  computer  are  all  presented.  Techniques  for  map¬ 
ping  problem  solutions  onto  parallel  processor  architectures  are  then  investigated, 
followed  by  a  discussion  of  the  problems  associated  with  parallel  algorithm  imple¬ 
mentations.  This  chapter  concludes  with  a  presentation  of  recent  implementations 
by  others  on  MIMD  parallel-processor  computers. 

2.1  Parallel  Processor  Architectures 

The  Von  Neumann  machine  is  a  sequential  computer  consisting  of  a  central 
processing  unit  (CPU),  a  memory  system,  and  an  input/output  (I/O)  system.  In¬ 
structions  are  accessed  from  the  memory  system  and  executed  in  the  CPU  one  at  a 
time.  This  Von  Neumann  model  of  a  sequential  computer  is  the  underlying  architec¬ 
ture  of  a  majority  of  the  conventional  computers  available  today  [EbB86].  Steady 
improvements  in  VLSI  technology  have  allowed  this  sequential  architecture  to  remain 
popular  by  reducing  the  signed  propagation  delays,  discussed  in  Chapter  1,  between 
the  CPU  and  memory  [EbB86|.  However,  reducing  signal  propagation  delays  is  be¬ 
coming  increasingly  more  difficult  because  the  physical  limits  of  signal  transmission 
speed  in  silicon,  the  most  common  fabrication  material,  is  being  approached  [Den86]. 
This  is  one  motivation  behind  the  development  parallel  computer  architectures  to 
achieve  faster  processing  times  instead  of  attempting  to  speed  up  sequential  Von 
Neumann  computers. 
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A  processing  element  (PE)  can  be  basically  defined  as  a  CPU  and  a  local  mem¬ 
ory  unit  for  storing  programs  and  local  data.  Parallel  computer  architectures  utilize 
a  number  of  processing  elements,  usually  in  the  form  of  Von  Neumann  machines, 
that  are  linked  together  by  an  interconnection  network.  This  interconnection  net¬ 
work  provides  a  means  to  either  transfer  information  between  the  different  processing 
elements  or  to  allow  access  to  a  common  data  storage  area.  The  following  sections 
discuss  the  different  types  of  parallel  architectures. 

2.1.1  Flynn’s  Classification  of  Architectures  Flynn  classified  computer  ar¬ 
chitectures  into  four  categories  according  to  the  number  of  instruction  and  data 
streams  utilized  [Fly66] .  Those  categories,  are  Single  Instruction-stream  Single  Data- 
stream  (SISD),  Single  Instruction-stream  Multiple  Data-stream  (SIMD).  Multiple 
Instruction-stream  Single  Data-stream  (MISD),  and  Multiple  Instruction-stream  Mul¬ 
tiple  Data-stream  (MIMD).  The  SISD  category  describes  the  sequential  Von-Neumann 
machines.  MISD  is  generally  regarded  as  an  impractical  classification  of  a  computer 
architecture  [HwB84].  The  SIMD  and  MIMD  categories  describe  the  architectures  of 
parallel  computers.  Representations  of  these  classifications  are  shown  in  Figure  2-1. 

SIMD  machines  axe  generally  comprised  of  a  number  of  simple  processing  el¬ 
ements  statically  linked  to  a  central  control  unit  that  interprets  instructions  and 
issues  commands  to  the  processing  elements.  Processing  in  parallel  SIMD  machines 
is  usually  characterized  by  identical  operations  simultaneously  performed  in  lock  step 
on  each  element  of  an  array  or  matrix.  The  Illiac  IV,  one  of  the  first  SIMD  machines 
developed  in  the  1960’s,  was  used  to  solve  problems  in  areas  such  as  fluid  flow,  aero¬ 
dynamics,  and  meteorology  [RiS84].  A  recently  introduced  SIMD  computer  is  the 
Connection  Machine,  which  employs  65,536  simple  processors  [Hil87j. 

In  contrast  to  SIMD  machines,  the  individual  processors  in  MIMD  machines  do 
not  necessarily  perform  the  same  instructions  at  the  same  time.  Processing  elements 
are  relatively  independent  and  each  one  may  be  executing  a  completely  different 
program.  Different  types  of  MIMD  architectures  will  be  discussed  in  the  next  section. 
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Figure  2-1:  Flynn’s  Classifications  (a)  SISD  (b)  M1SD  (c)  SIMD  (d)  MIMD 
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2.1.2  Types  of  MIMD  Architectures  The  MIMD  classification  of  a  computer 
architecture  can  be  further  divided  into  two  sub-classes,  based  on  the  memory  struc¬ 
ture  and  the  type  of  interprocessor  communications.  One  sub-class  is  the  shared 
memory  machine  where  till  individual  processing  elements  have  access  to  a  large 
global  memory  which  is  used  to  access  common  data  and  to  pass  information  be¬ 
tween  processors.  Shared-memory  machines  are  also  known  as  tightly  coupled  pro¬ 
cessors  because  of  the  degree  of  interaction  between  processors  imposed  by  the  global 
memory  [HwB84]. 

Another  sub-class  of  MIMD  computers  is  the  local  memory  or  loosely-coupled 
machine.  Processors  in  loosely-coupled  machines  each  possess  their  own  private 
memory  that  is  not  accessible  by  the  other  processors.  Information  is  exchanged 
between  processors  by  passing  messages  through  the  interconnection  network.  Pro¬ 
cessors  in  these  machines  are  generally  more  independent  than  those  in  the  shared 
memory  machines.  Loosely- coupled  machines  derive  their  name  from  the  reduced 
interaction  between  the  individual  processors  [MuA87].  Many  of  the  commercial 
MIMD  computers  available  today  are  loosely  coupled  [HwB84], 

2.1.3  The  Hypercube  Interconnection  Network  Parallel  solutions  to  certain 
problems  sometimes  require  the  processors  to  be  configured  into  a  ring,  mesh,  star, 
or  tree  structure  [Fen8l].  There  are  a  number  of  ways  to  interconnect  the  processors 
in  an  MIMD  multiprocessor  computer.  One  class  of  interconnection  network  that 
can  function  as  any  of  the  listed  configurations  is  based  on  the  cube  interconnection 
function  [Sei85] .  The  m  cube  function  can  be  defined  as: 


cube,  [Pm- 1 i  •  •  •  i  Pi  i  Po)  Pm-  2  •  •  •  Pi+l  Pi  Pt-1  Pi  Po 
where  0  <  i  <  m  and  p,  denotes  the  complement  of  p,. 


(2-1) 


The  cube  function  is  the  basis  for  networks  such  as  the  multistage  cube  network 


[McS85],  the  Boolean  n-cube  [Pea77],  and  the  hypercube  [SaS85].  The  hypercube 
interconnection  scheme  has  the  advantage  that  if  the  total  number  of  processors  is 
N,  the  maximum  number  of  intermediate  links  that  must  be  traversed  by  a  message 
from  one  processor  in  order  to  communicate  with  any  other  processor  in  the  network 
is  log2N. 

The  processors  in  a  hypercube  interconnection  network  are  linked  together 
based  on  the  binary  representation  of  the  processor’s  address.  Processors  whose 
binary  addresses  differ  by  only  one  bit  (i.e.,  the  cube  function  cube,  for  bit  i)  are 
connected.  For  example,  in  a  three-dimensional  cube  there  are  8  =  23  processors. 
These  binary  addresses  can  be  represented  as  shown  in  Table  2-1. 


Table  2-1.  Processor  Binary  Addresses 


Processor 

Address 

0 

000 

1 

001 

2 

010 

3 

Oil 

4 

100 

5 

101 

6 

no 

7 

111 

Each  processor  is  connected  to  three  other  processors  using  this  scheme.  The 
resulting  structure  can  be  represented  by  the  diagram  illustrated  in  Figure  2-2  where 
the  labeled  nodes  represent  processors  and  the  lines  represent  the  links  between  the 
processors.  Different  dimension  hypercubes  can  be  formed  by  following  the  same 
addressing  scheme. 


2.2  The  Intel  iPSC  Hypercube  Computer 


The  Intel  iPSC  hypercube  computer  is  used  for  implementing  the  parallel  as- 


Figure  2-2.  Three-Dimension  Cube  Structure 


signment  algorithms  developed  in  Chapters  3  and  4  of  this  study.  An  overview  of  the 
Intel  machine  and  some  of  its  important  features  is  necessary  in  order  to  understand 
some  of  the  decisions  that  are  made  during  development  of  the  implementations. 

2.2.1  History  The  origin  of  the  Intel  iPSC  can  be  traced  back  to  research 
performed  at  Caltech  and  the  NASA  Jet  Propulsion  Laboratory  during  1978-1981 
[Sei85j.  This  research  formed  the  basis  for  an  MJMD,  local  memory,  multiproces¬ 
sor  machine  that  was  designed  and  built  primarily  as  a  hardware  simulation  of  a 
computer  researchers  expect  to  be  able  to  implement  entirely  in  VLSI  in  the  future. 
However,  the  excellent  performance  of  the  prototype  prompted  Seitz  and  his  col¬ 
leagues  to  experiment  with  solving  a  variety  of  computationally-intensive  problems. 
They  nick-named  this  new  machine  the  Cosmic  Cube  [Sei85].  The  Cosmic  Cube  was 
later  developed  into  a  commercial  computer  system  named  the  iPSC  (Intel  Personal 
Super  Computer)  by  the  Intel  Corporation.  Customer  shipments  of  the  iPSC  began 
in  February  1985  [Den86].  The  iPSC  is  now  available  to  researchers  at  many  centers, 
including  the  Air  Force  Institute  of  Technology  (AFIT). 


2.2.2  Hardware  Organization  The  Intel  iPSC  is  available  in  several  config¬ 
urations  ranging  from  a  16-processor.  4-dimension  cube  up  to  a  128-processor.  7- 
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dimension  cube.  In  the  basic  configuration,  each  processor  is  built  around  an  Intel 
80286  microprocessor,  an  80287  numeric  coprocessor,  and  512K  of  random  access 
memory  (RAM)  [Int86].  Options  such  as  additional  memory  and  vector- processing 
capabilities  can  make  the  iPSC  a  very  powerful  machine  for  a  modest  cost  when 
compared  to  large  supercomputers  such  as  the  well  known  and  expensive  Cray  se¬ 
ries. 

The  individual  processing  elements  or  nodes  are  interconnected  in  a  hypercube 
topology,  with  communications  coprocessors  handling  the  processor- to- processor  mes¬ 
sage  passing  duties.  The  user  develops  applications  for  and  communicates  with  pro¬ 
cessors  in  the  cube  through  an  intermediate  host  known  as  the  cube  manager.  The 
cube  manager  is  also  built  around  the  80286  microprocessor  and  80287  coprocessor, 
but  has  additional  memory  capacity  [Int86], 

2.2.3  Software  Development  Environment  The  software  development  envi¬ 
ronment  of  the  iPSC  is  based  on  a  derivative  of  the  UNIX  operating  system  known 
as  the  XENIX  environment  [Int86l  The  languages  supported  are  parallel  versions 
of  FORTRAN,  C,  and  Lisp.  Applications  are  developed  using  the  cube  manager  as 
a  means  to  compile,  debug,  and  run  programs  written  in  these  modified  languages. 
Predefined  library  functions  are  used  to  perform  operations  such  as  opening  commu¬ 
nications  channels  between  the  cube  manager  or  other  processors,  sending  or  receiv¬ 
ing  either  synchronous  or  asynchronous  messages  from  other  processors,  controlling 
processes  running  on  processors  in  the  cube,  and  many  other  functions  unique  to  the 
Intel  hvpercube.  A  program  to  simulate  the  functions  of  the  iPSC  hypercube  for  use 
in  initial  program  development  is  available  for  other  systems  running  the  BSD  4.2 
UNIX  operating  system.  However,  accurate  performance  data  for  applications  must 
be  obtained  using  the  actual  Intel  iPSC  machine. 


2.8  MIMD  Mapping  Techniques 


Much  has  been  written  about  techniques  for  developing  applications  software 
for  parallel  MIMD  computers.  These  techniques  are  sometimes  known  as  mapping 
techniques  [Sei85,  Fox84].  The  most  important  mapping  techniques  and  some  areas 
to  be  concerned  with  while  developing  implementations  are  covered  in  this  section. 

2.3.1  The  Basic  Approach  Many  science  and  engineering  problems  are  nat¬ 
urally  divided  into  concurrent  processes  [Sei85].  If  they  are  relatively  independent, 
either  one  or  several  of  these  processes  can  be  assigned  to  separate  nodes  or  pro¬ 
cessors  in  a  parallel  computer  for  concurrent  processing.  Then  the  “intriguing  and 
.  .  .  amusing"  task  of  coordinating  the  computing  activities  in  each  processor  must  be 
devised  [Sei85].  Continuing,  Seitz  says  from  experience  that  application  formulation 
for  the  multiprocessor  Cosmic  Cube  “has  not  proved  to  be  very  much  more  difficult 
than  it  is  on  sequential  [single  processor]  machines."  In  many  cases,  he  says  parallel 
applications  are  based  on  adaptations  of  well  known  sequential  algorithms. 

Fox  and  Otto  maintain  that  ‘"the  main  stumbling  block  to  the  use  of  concurrent 
processors  is  the  difficulty  of  formulating  algorithms  and  programs  for  them."  They 
go  on  to  say  “that  concurrent  processors  are  quite  easy  to  use  and  .  ..address  the 
vast  majority  of  computationally  intensive  problems."  They  agree  with  Seitz  when 
they  say  that  most  computationally  demanding  problems  are  not  solved  by  using 
complex  algorithms,  but  “rather  there  is  a  relatively  simple  procedure  ...that  one 
must  apply  to  a  basic  unit  ...  in  a  world  that  consists  of  a  huge  number  of  such 
units." 


2.3.2  Communications  Overhead  One  of  the  most  common  problems  asso¬ 
ciated  with  applications  for  parallel  processing  is  the  minimization  of  the  commu¬ 
nications  overhead.  The  ratio  of  communications  to  computations  should  be  ap¬ 
proximately  one  (unity)  [Fox84].  This  means  that  the  amount  of  communications 
should  not  be  greater  than  the  computations.  An  example  of  communications  over- 


head  is  when  information  about  a  problem  subdomain  contained  in  one  processor 
is  needed  by  another  processor  working  on  a  different  subdomain.  Exchange  of  this 
information  requires  that  these  processors  communicate  with  each  other  using  the 
interconnection  network.  This  type  of  communication  between  processors  must  be 
kept  as  low  as  possible  [Fox84].  “An  important  measure  of  an  algorithm’s  efficiency 
. . .  [is]  . . .  the  time  to  move  the  data"  [HoZ83].  This  “time  to  move  the  data"  referred 
to  by  Horowitz  and  Zorat  is  the  communications  overhead. 

Saltz  says  there  are  several  techniques  that  can  be  used  to  reduce  the  commu¬ 
nications  overhead.  One  technique  is  to  reduce  the  quantity  of  information  to  be 
communicated  by  only  sending  information  that  is  absolutely  necessary.  Another 
method  is  to  reduce  the  frequency  of  communications  by  sending  several  bits  of  in¬ 
formation  in  each  message  [SaN85],  Saltz  mentions  one  other  method  that  in\'olves 
overlapping  communications  with  processing,  which  can  be  accomplished  by  using 
asynchronous  message-passing  library  functions  in  the  Intel  iPSC  programming  en¬ 
vironment.  Another  technique  for  reducing  communications  overhead,  related  to 
problem  partitioning,  involves  increasing  the  size  of  the  subdomain  assigned  to  each 
processor.  This  absorbs  some  of  the  communications  that  would  have  been  necessary, 
but  also  reduces  the  level  of  parallelism  [SaN85]. 

2.3.3  Problem  Partitioning  According  to  Fox  and  Otto,  the  first  step  in  for¬ 
mulating  a  solution  to  large  problems  on  a  concurrent  processor  is  to  partition  the 
problem  into  many  parts  and  assign  a  different  part  of  the  problem  to  each  indi¬ 
vidual  node  or  processor.  Part  of  the  difficulty  with  partitioning  the  large  problem 
is  deciding  on  the  size  of  the  subproblems.  If  the  subproblems  are  too  small,  there 
is  a  chance  that  excessive  communications  between  processors  will  be  necessary  to 
complete  the  solution  [Fo084].  On  the  other  hand,  they  say  forming  larger  subprob¬ 
lems  tends  to  reduce  the  communications  overhead  and  increase  the  efficiency  of  the 
computations. 
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Cvetanovic  has  written  a  paper  discussing  the  effects  of  problem  partitioning 
and  granularity  on  multiprocessor  performance  [Cve87],  She  says  that  as  the  size  of 
computations  performed  on  the  separate  processors  decreases,  the  amount  of  parallel 
computations  increases.  But  because  of  the  increased  parallelism,  computations  are 
performed  faster  and  more  requests  for  additional  data  or  communications  with  other 
processors  are  initiated.  As  the  communications  increase,  the  overall  processing  slows 
down.  According  to  Cvetanovic,  the  following  parameters  are  likely  to  have  the  most 
significant  effects  on  multiprocessor  performance: 

•  The  amount  of  parallelism  inherent  in  the  application  of  the  problem. 

•  The  method  for  decomposing  a  problem  into  smaller  subproblems. 

•  The  method  applied  to  allocate  these  subproblems  to  processors. 

•  The  grain  size  of  a  subproblem  executed  on  each  processor. 

Cvetanovic  concludes  that  problem  partitioning  has  a  strong  effect  on  multipro¬ 
cessor  performance.  If  the  subproblem  size  introduces  unacceptable  communications 
overhead,  she  suggests  two  methods  for  reducing  this  overhead.  The  first  method  is 
to  increase  the  capabilities  of  the  interprocessor  communications  network.  This  is 
seldom  possible,  so  the  second  method  she  suggests  is  more  promising.  It  involves 
increasing  the  subproblem  size  in  order  to  transform  some  interprocessor  communi¬ 
cations  into  intraprocessor  communications.  This  transformation  effectively  reduces 
the  demands  on  the  communications  network  and  increases  overall  performance. 

2.3.4  Load  Balancing  Another  factor  to  consider  in  partitioning  a  problem 
is  the  “load  balancing”  [FeK85,  Fo084].  Efficiency  is  increased  if  all  processors  are 
performing  essentially  the  same  computations.  The  general  idea  is  that  the  amount 
of  communications  between  processors  is  not  as  important  as  the  “amount  of  compu¬ 
tation  done  per  communication”  [Fo084].  Fox  says  that  memory  requirements  per 
processor  must  be  equal  and  fixed  in  order  to  ensure  the  efficiency  of  the  implemen¬ 
tation  will  not  depend  on  the  number  of  processors  in  the  machine.  This  restriction 
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achieves  load  balancing  of  the  processors  by  insuring  that  no  one  processed  will 
perform  the  bulk  of  the  computations  [Fox84]. 

2.8.5  i'se  of  Sequential  Algorithms  On  the  use  of  sequential  algorithms  in 
parallel  implementations.  Fox  says  that  each  processor  performs  essentially  the  same 
computations  a  single  processor  computer  would  perform.  The  difference  is  that  the 
computations  are  performed  on  a  subdomain  of  the  overall  problem.  He  says  tin- 
development  of  programs  to  run  on  the  individual  processors  of  a  multiprocessor 
computer  should  be  very  similar  to  those  used  in  a  uniprocessor  machine.  An  excep¬ 
tion  to  this  similarity  occurs  when  “boundary  conditions”  must  be  considered  where 
the  problem  domains  of  programs  running  in  different  processors  overlap.  In  cases 
such  as  these,  interprocessor  communications  and  some  type  of  synchronization  must 
occur  in  order  to  complete  the  solution,  which  in  turn  reduces  the  efficiency  of  tin- 
processing. 

In  some  cases,  the  adaptation  of  a  sequential  algorithm  into  a  parallel  algorithm 
introduces  other  overheads  in  addition  to  the  communications  overhead.  These  addi¬ 
tional  overheads  may  involve  “housekeeping  chores”  and  imply  that  not  all  sequential 
algorithms  are  adaptable  to  parallel  implementations  [Cve87],  Also,  sequential  algo¬ 
rithms  may  not  expose  all  the  parallelism  present  in  the  problem  [HaL82]. 


2-4  Other  Implementations 

As  noted  in  the  introduction,  there  have  been  several  software  implementa¬ 
tions  recently  developed  for  parallel  computers.  This  section  briefly  describes  some 
of  these  implementations  that  use  loosely-coupled  M1MD  computers  like  the  Intel 
iPSC.  Many  applications  have  been  developed  at  Caltech  and  NASA  JPL  in  a  wide 
range  of  problem  areas  such  as  high  energy  physics,  fluid  flow,  astrophysics,  image 
processing,  chemistry,  structural  mechanics,  and  other  areas  [Fo084,Sei85,  FoxSlj. 
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These  applications  axe  too  numerous  to  describe  here,  but  a  few  select  applications 
are  described,  along  with  applications  developed  by  researchers  at  other  institutions. 

2.4  1  Boost-Phase  Track  Initiation  Algorithms  One  implementation  closely 
related  to  the  ones  that  will  be  developed  in  this  thesis  was  developed  by  Gottschalk 
at  Caltech.  His  implementation  was  developed  on  a  version  of  their  Cosmic  Cube 
which  is  similar  to  the  Intel  iPSC.  The  problem  involved  determining  the  tracking 
of  ballistic  missiles  in  the  boost  phase  by  selecting  the  likely  missile  tracks  and 
eliminating  the  unlikely  or  redundant  tracks.  His  solution  method  used  sequential 
algorithms  in  each  node  of  the  hypercube  with  the  number  of  nodes  a  factor  of 
4  less  than  the  number  of  targets  per  track.  Significant  speedups  over  sequential 
implementations  of  the  same  Kalman  filter  technique  used  in  the  parallel  version 
were  achieved  [Got87], 

2. 4- 2  Parallel  Branch  and  Bound  Mraz  developed  two  implementations  of  a 
parallel  branch-and-bound  algorithm  for  the  Intel  iPSC.  He  solved  an  N-queens  prob¬ 
lem  and  a  deadline  job  scheduling  problem  using  the  branch  and  bound  technique. 
His  method  used  a  tree  structure  embedded  into  the  hypercube  interconnection  net¬ 
work  that  was  used  to  search  the  problem  solution  space  [Mra86j.  He  reported 
speedups  over  sequential  implementations  of  similar  algorithms,  however  for  small 
problem  sizes,  the  sequential  implementation  performed  better.  This  appeared  to  be 
caused  by  several  factors,  one  which  involved  problem  partition  size.  The  other  fac¬ 
tor  was  related  to  synchronization  of  the  tasks  within  the  hvpercube,  which  reduced 
the  amount  of  parallelism  achievable.  As  the  size  of  the  problems  were  increased,  the 
speedup  and  efficiency  of  the  parallel  implementations  showed  good  improvement. 
The  results  of  this  thesis  point  out  the  important  effects  problem  partitioning  and 
processor  communications  have  on  the  overall  performance. 

2. 4- 3  The  Traveling  Salesman  Problem  This  implementation  was  also  devel¬ 
oped  at  Caltech  on  one  of  their  Cosmic  Cube  parallel  computers.  The  traveling  sales- 
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man  problem  is  a  classic  optimization  problem  that  has  applications  in  areas  such 
as  circuit  layout,  VLSI  design,  resource  allocation,  and  logistical  problems  [FeK85], 
The  basic  problem  is  to  find  the  shortest  tour  for  a  traveling  salesman  who  must,  for 
the  least  cost,  visit  a  number  of  cities  only  once.  The  solution  space  of  this  problem 
grows  factorially  as  the  number  of  cities  are  increased  linearly  because  of  the  number 
of  possible  routes  the  salesman  could  take.  The  solution  method  utilized  by  Felten 
and  his  associates  was  a  statistical  mechanics  technique  known  as  simulated  anneal¬ 
ing.  A  mesh  structure  was  embedded  into  the  hypercube  network  in  order  to  match 
the  structure  of  the  simulated  annealing  algorithm.  This  implementation  exhibited 
speedups  over  sequential  implementations  ranging  from  1.92  using  two  processors  to 
54.92  using  sixty-four  processors.  These  speedups  are  not  ideal,  but  represent  signifi¬ 
cant  reductions  in  the  processing  times  required  to  solve  this  important  optimization 
problem. 

2.4- 4  Gaussian  Elimination  Gaussian  elimination  is  a  computationally  in¬ 
tensive  method  used  to  solve  dense  linear  systems  that  requires  manipulations  of 
the  rows  and  columns  of  large  matrices  [Saa86].  Saad  examined  several  methods  of 
mapping  solutions  to  this  problem  onto  the  Intel  iPSC  computer.  He  found  through 
computational  experiments  that  this  particular  problem  was  best  solved  using  a  grid 
structure  embedded  into  the  hypercube  network.  The  use  of  a  piplining  technique 
combined  with  the  grid  algorithms  produced  the  lowest  amount  of  communications 
between  processors,  which  was  pointed  out  in  Section  2.3.2  as  the  most  important 
overhead  to  reduce. 

2.4- 5  Vision  Algorithms  A  hypercube  implementation  that  applies  image 
processing  techniques  to  printed  circuit  inspection  was  accomplished  by  Mudge  and 
Abdel-Rahman.  They  used  a  gray-code  scheme  similar  to  a  Karnaugh  map  to  par¬ 
tition  and  assign  portions  of  an  image  to  separate  processors  in  a  128- processor 
NCUBE  hypercube  computer  [MuA87],  Their  problem  was  to  process  the  image  of 
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a  printed  circuit  under  inspection  in  order  to  extract  certain  features  and  compare 
them  with  a  template  of  the  correct  image.  If  discrepancies  were  detected  between  the 
template  and  the  image  under  inspection,  the  printed  circuit  was  rejected  as  faulty. 
The  solution  of  this  problem  required  the  processing  of  approximately  10  Mbytes  of 
data  in  a  few  seconds.  Although  they  used  a  128-processor  machine  to  obtain  their 
experimental  results,  they  predicted  that  40  frames  of  512  x  512  l-byte  images  could 
be  completely  processed  in  less  than  three  seconds  using  a  1024-processor  version  of 
the  same  NCUBE  computer.  Two  problems  they  encountered  were  computational 
overheads  in  algorithms  and  in  communications,  which  were  cited  in  Section  2.3  as 
potential  problems  with  parallel  implementations. 


2.5  Summary 


This  chapter  presented  a  brief  discussion  of  parallel-processor  architectures  and 
an  overview  of  Intel's  iPSC  MIMD  computer.  Techniques  for  developing  applications 
for  machines  similar  to  the  Intel  iPSC  were  discussed,  with  particular  emphasis 
on  problem  partitioning  and  interprocessor  communications.  A  few  of  the  many 
recent  implementations  on  parallel  MIMD  computers  were  presented  and  some  of 
the  problems  associated  with  those  implementations  were  noted.  Chapter  3  begins 
the  process  of  developing  parallel  weapon-target  assignment  algorithms  by  examining 
sequential  assignment  algorithms.  Mapping  techniques  introduced  in  this  chapter  are 
expanded  for  use  in  developing  parallel  implementations  of  assignment  algorithms 
utilizing  as  many  features  as  possible  from  these  sequential  algorithms. 
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3.  Development  of  the  Parallel  Assignment  Algorithms 


In  this  chapter,  a  parallel  weapon-target  assignment  algorithm  is  developed 
for  implementation  on  the  Intel  iPSC  computer  using  the  techniques  presented  in 
Chapter  2.  First,  a  formal  mathematical  definition  of  the  assignment  problem  is 
given.  A  background  on  research  conducted  during  the  past  three  decades  on  dif¬ 
ferent  solutions  to  the  assignment  problem  is  then  presented.  The  general  classes 
of  assignment  algorithms  that  have  emerged  from  this  research  are  described,  fol¬ 
lowed  by  a  detailed  analysis  of  several  candidate  sequential  assignment  algorithms 
with  the  goal  of  selecting  one  of  these  algorithms  for  parallelization.  Next,  different 
techniques  for  performing  parallel  search  of  a  problem  solution  space  are  explored. 
In  the  final  section  of  this  chapter,  a  parallel  search  technique  and  one  of  the  se¬ 
quential  algorithms  are  selected  for  use  in  the  parallel  assignment  algorithms.  This 
chapter  concludes  with  a  summary  of  the  parallel  algorithms  developed  and  their 
implications  on  the  remainder  of  this  research. 

S.l  The  Assignment  Problem 

In  Chapter  1,  optimum  assignment  was  characterized  as  a  problem  whose  solu¬ 
tion  time-space  complexity  increases  factorially  with  a  linear  increase  in  the  number 
of  resources  and  requesters  [Chu57],  There  are  several  variations  in  the  details  of  the 
how  the  assignment  problem  is  stated.  In  some  instances,  it  is  considered  a  special 
case  of  the  transportation  problem,  where  there  are  several  resources  at  each  source 
of  supply  and  multiple  requests  for  those  resources  at  each  sink.  The  assignment 
problem  addresses  a  special  case  of  the  transportation  problem  where  there  is  only 
one  instance  of  a  resource  at  each  source  and  only  one  instance  of  that  resource  i? 
required  by  each  requester.  The  transportation  problem  itself  is  a  special  case  of  a 
general,  single-objective,  linear  programming  problem  [Ign82|. 


8.1.1  History  Research  on  finding  faster  and  more  efficient  solutions  to  the 
assignment  problem  has  a  long  history,  beginning  with  graph  theoretical  work  pre¬ 
sented  by  Hungarian  mathematicians  Konig  and  Egervary  in  1931.  More  recent 
developments  were  accomplished  by  Dantzig,  Flood,  Von  Neumann,  and  Kuhn  in 
the  1950's  [Chu57],  The  programming  methods  and  algorithms  developed  in  the 
1950’s  form  the  basis  for  much  of  the  work  that  has  been  done  on  assignment  prob¬ 
lem  solutions  up  to  the  present  [MaN86].  Various  modifications  to  these  original 
assignment  algorithms  have  been  made  in  an  effort  to  enhance  their  execution  speed 
and  efficiency  on  modern  digital  computers  [McG83,  CaT80,  BaG77.  HunS3.  BerSl. 
G1K74,  Hat75,  SrT73,  MaN86]. 

3.1.2  Statement  of  the  Assignment  Problem  The  assignment  problem  can  be 
stated  in  words  as:  Given  a  number  of  resources  and  a  number  of  requesters  of  those 
resources,  and  given  the  profit  or  usefulness  of  each  resource  to  each  requester  in 
the  form  of  a  rating  matrix  where  element  atJ  is  the  profit  of  assigning  resource  i  to 
requester  j,  the  problem  is  to  assign  each  resource  to  one  and  only  one  requester  in 
a  way  that  a  given  measure  of  effectiveness  is  optimized  [Chu57].  Mathematically, 
the  assignment  problem  can  be  stated  as  follows: 


Given  an  n 2  rating  matrix 

A  =  M.  av  >  0  for  hj  =  1,2, . . .  ,n  (n  >  3) 
Find  an  n2  assignment  matrix  X  =  j| ar.j  ||  such  that 


1  if  resource  i  is  assigned  to  requester  j 
0  otherwise 


X'J  ~  xo  ~  i 
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The  conditions  of  Equations  3-2  and  3-3  specify  that  each  row  and  column  of  ma¬ 
trix  X  will  contain  one  element  with  a  value  of  1  and  all  other  elements  will  be  zero 
[Chu57],  The  requirement  of  square  matrices  at  first  appears  to  limit  the  problem 
to  cases  where  the  number  of  resources  equals  the  number  of  requesters.  But  situa¬ 
tions  where  they  are  not  equal  can  also  be  solved  by  adding  ‘‘dummy”  resources  and 
requesters  to  make  matrix  A  square.  The  associated  rating  or  cost  of  these  added 
matrix  elements  should  be  set  to  zero  so  that  they  will  not  be  included  in  the  final 
assignment  solution.  Other  more  efficient  methods  of  handling  this  unequal  situation 
have  also  been  devised  [BoL71a]. 

3.2  Sequential  Assignment  Algorithms 

As  stated  in  Section  3.1.1,  much  of  the  development  of  assignment  algorithms 
over  the  past  three  decades  has  been  based  on  the  research  accomplished  in  the 
19o0‘s  by  Dantzig,  Flood,  Von  Neumann,  and  Kuhn.  Other  methods  developed  in 
the  study  of  network  flow  have  provided  additional  means  of  solving  the  assignment 
problem  [Smi82].  Two  basic  approaches,  a  simplex-based  transportation  method 
and  the  Hungarian  method,  have  emerged  as  the  most  popular  means  of  solving  the 
assignment  problem  primarily  because  of  their  simplicity  and  ease  of  implementation 
[Hat75.  G1K74,  MaN86].  Because  of  limited  time  and  space,  all  of  the  many  different 
assignment  algorithms  are  not  covered  in  detail.  Instead,  brief  summaries  of  each 
are  presented  in  this  section.  Then,  in  the  following  section,  the  transportation 
and  the  Hungarian  methods  for  solving  the  assignment  problem  are  analyzed.  A 
detailed  presentation  and  an  example  problem  of  both  methods  are  presented  in  the 
Appendices  to  illustrate  how  the  algorithms  operate. 

3.2.1  The  Simplex  Method  The  simplex  method,  developed  by  Dantzig,  is  a 
general  approach  that  can  be  used  to  solve  most  all  single-objective,  linear  program¬ 
ming  problems.  The  basic  approach  of  the  simplex  method  is  to  start  with  a  feasible 
solution  to  a  problem  and  improve  upon  this  solution  in  a  step-by-step  fashion  un 


til  an  optimum  solution  is  reached  [Kre68].  A  feasible  solution  means  that  all  the 
constraints  placed  on  the  optimization  (minimization  or  maximization)  in  the  origi¬ 
nal  problem  statement  are  satisfied.  In  terms  of  the  assignment  problem,  a  feasible 
solution  would  be  one  where  each  resource  is  assigned  to  a  different  requester. 

The  type  of  problem  most  easily  solved  by  the  simplex  method  is  one  where 
there  is  a  single  objective  function  that  is  to  be  maximized  or  minimized,  subject  to 
constraints  w'hich  are  stated  in  the  form  of  a  system  of  linear  equations.  Additional 
variables,  called  slack  variables,  are  added  to  this  system  of  equations  to  aid  in 
converging  on  the  optimal  solution.  During  the  course  of  the  solution,  there  are  two 
sets  of  variables.  One  set  is  called  basic  and  consists  of  variables  that  have  been 
incorporated  into  the  present  version  of  the  solution.  The  other  set  of  variables  is 
called  non-basic  and  is  comprised  of  variables  not  incorporated  into  the  solution. 
Variables  are  modified  and  exchanged  between  the  basic  and  non-basic  sets  one  at  a 
time  until  conditions  indicate  that  an  optima!  solution  has  been  reached.  One  of  the 
primary  disadvantages  of  the  general  simplex  method  is  that  the  solution  it  provides 
is  not  integer-valued.  Modified  versions  of  the  general  simplex  method  have  been 
developed  to  provide  integer  solutions,  but  they  are  somewhat  less  efficient  [Ign82]. 

Because  the  simplex  method  is  a  general  approach,  specialized  versions  of  it 
have  been  developed  to  solve  specific  problems.  Different  rules  are  adapted  for 
selecting  variables  to  enter  the  basic  set  and  vary  according  to  the  type  problem 
being  solved.  One  example  of  a  specialized  version  is  the  transportation  method 
which  will  be  described  next. 

3.2.2  The  Transportation  Method  The  transportation  problem  originated  from 
studies  made  to  improve  the  efficiency  of  utilizing  available  transport  capacity  in  the 
railway  and  trucking  industries.  An  example  of  this  type  problem  is  minimization  of 
the  cost  of  moving  empty  freight  cars  from  their  present  locations  to  other  locations 
where  they  can  be  used  to  transport  goods  [Chu57],  The  transportation  problem 


existed  prior  to  the  development  of  the  simplex  method.  However,  efficient  solu¬ 
tions  had  not  been  developed  for  it  until  the  techniques  of  the  simplex  method  were 
applied  [Ign82].  As  stated  earlier,  the  assignment  problem  is  a  special  case  of  the 
transportation  problem.  The  transportation  method  uses  a  cost  or  rating  matrix 
to  represent  the  problem  similar  to  the  one  described  in  the  assignment  problem 
statement.  An  additional  row  and  column  is  added  to  the  rating  matrix  to  represent 
the  number  of  resources  available  at  each  source  and  the  number  of  requests  for 
those  resources  at  each  sink.  In  order  to  use  the  transportation  method  to  solve  the 
assignment  problem,  all  values  in  this  additional  column  and  row  must  be  set  to  one. 

The  basic  steps  of  the  transportation  method  are  similar  to  the  general  simplex, 
although  they  are  somewhat  obscured  by  the  matrix  representation  of  the  problem. 
Many  of  the  computations  that  would  be  normally  be  required  by  the  simplex  method 
are  avoided  by  exploiting  this  matrix  representation  and  using  a  somewhat  simpler 
approach  [Kre68,  Ign82],  There  are  two  phases  to  the  transportation  technique.  The 
first  phase  generates  a  basic  feasible  solution  to  satisfy  all  the  problem  constraints 
(i.e.,  make  initial  assignments  of  all  resources  to  all  requesters).  The  second  phase 
consists  of  determining  whether  or  not  the  initial  solution  can  be  improved.  If  not. 
the  algorithm  terminates.  Otherwise,  the  current  assignment  is  reshuffled  to  improve 
the  value  of  the  objective  function.  This  reshuffling  is  analogous  to  the  exchange  of 
basic  and  non-basic  variables  in  the  simplex  technique  [Ign82].  When  the  solution 
obtained  in  this  manner  cannot  be  improved  upon,  or  if  it  is  found  to  be  unbounded, 
then  the  algorithm  terminates.  The  exact  steps  of  the  transportation  algorithm  are 
presented  in  Appendix  A.  The  next  section  describes  another  modification  to  the 
simplex  method. 

3.2.3  The  Alternating  Basis  Algorithm  The  Alternating  Basis  (AB)  algo¬ 
rithm  is  a  modification  to  the  simplex  method  which  avoids  the  unnecessary  inspec¬ 
tion  of  alternative  feasible  solutions  [BaG77j.  It  was  presented  in  1977  by  Barr, 
Glover,  and  Klingman  in  an  effort  to  reduce  the  storage  requirements  and  compu- 


tational  inefficiencies  of  using  the  simplex  method  to  solve  the  assignment  problem 
Their  approach  uses  a  rooted  tree  graphical  representation  of  the  problem  where  each 
node  in  the  tree  corresponds  to  either  a  source  or  a  destination.  The  nodes  are  con¬ 
nected  by  arcs  which  are  assigned  a  value  of  1  if  the  two  nodes  are  to  be  "assigned" 
to  each  other  and  0  otherwise.  The  “alternating"  part  of  the  algorithm's  name  stems 
from  the  alternating  manner  in  which  the  0-arcs  and  1-arcs  are  distributed  in  the 
tree  structure.  By  restricting  the  tree  structure  to  the  "alternating  path"  as  it  is 
referred  to  in  their  paper,  degenerate  solutions  that  would  normally  be  considered 
by  the  general  simplex  method  are  avoided  and  the  efficiency  of  the  computations 
is  increased.  The  feasible  solutions  or  bases  that  are  considered  are  incrementally 
improved  in  a  step-by-step  manner  exactly  as  in  the  general  simplex  method. 

Some  computational  comparisons  of  the  AB  algorithm  against  other  imple¬ 
mentations  of  simplex-based  algorithms  were  made  by  Barr  and  his  colleagues.  The 
results  showed  that  the  AB  algorithm  was  approximately  15%  faster  than  the  clos¬ 
est  competitor  [BaG77].  The  number  of  basic  and  non-basic  variable  exchanges  was 
reduced  by  as  much  as  25%  over  the  other  methods  in  the  comparison.  These  perfor¬ 
mance  figures  indicate  that  improved  performance  of  the  simplex  method  is  strongly 
dependent  on  the  rules  for  selecting  variables  to  enter  the  basic  solution  set. 

3.2.4  The  Hungarian  Method  Kuhn  presented  a  paper  in  1955  describing  a 
method  for  solving  the  assignment  problem  which  he  titled  “The  Hungarian  Method 
for  the  Assignment  Problem"  [Kuh55].  The  overall  scheme  of  the  Hungarian  method 
is  based  on  a  theorem  proved  by  the  Hungarian  mathematicians  Konig  and  Egervary 
[Chu57].  Their  theorem  involves  covering,  or  including  in  sets,  the  elements  of  a  ma¬ 
trix  which  belong  to  one  of  two  distinct  classes.  In  the  Hungarian  method,  these 
two  classes  are  formed  by  simple  subtractions  from  members  of  the  rating  matrix 
which  yield  null  elements  and  non-null  elements.  The  minimum  number  of  cover¬ 
ings,  referred  to  as  lines  by  Kuhn,  that  include  all  these  null  elements  is  equal  to  the 
maximum  number  of  elements  in  that  class.  The  Hungarian  method  provides  min 
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imum  or  maximum  cost  assignments  if  the  resulting  null  elements  are  independent. 
Independent  means  that  no  other  null  elements  occur  in  the  same  row  or  column. 
This  restriction  is  analogous  to  permitting  the  assignment  of  each  resource  to  only 
one  requester  and  vice  versa.  When  the  number  of  covering  lines  equals  the  number 
of  resources  to  be  assigned,  then  there  exists  in  the  set  of  covered  null  elements  at 
least  one  optimal  assignment  of  all  the  resources.  The  method  works  on  the  prin¬ 
ciple  of  selectively  reducing  all  elements  in  a  row  or  column  by  the  same  amount 
and  locating  independent  positions  in  the  matrix  that  first  become  null.  These  in¬ 
dependent  null  positions  correspond  to  minimum  or  maximum  cost  assignments.  A 
detailed  presentation  of  the  Hungarian  method  and  an  illustrative  example  problem 
are  included  in  Appendix  B.  Several  modifications  of  the  Hungarian  method  have 
been  made  since  its  introduction.  Some  of  these  modifications  are  presented  in  the 
next  section. 

3.2.5  Modifications  to  the  Hungarian  Method  Since  the  Hungarian  is  one  of 
the  more  popular  algorithms  for  solving  the  assignment  problem,  it  has  received  the 
most  attention  by  researchers  who  desired  to  improve  its  efficiency.  Carpaneto  and 
Toth  published  an  improved  version  of  the  Hungarian  method  in  1980  which  reduces 
the  amount  of  time  required  to  locate  the  zero  elements  and  the  unexplored  rows  of 
the  current  cost  matrix.  They  used  pointers  to  accomplish  this  improvement,  which 
also  reduced  the  storage  requirements  of  the  implementation.  Another  improvement 
they  made  over  the  original  Hungarian  method  was  to  modify  the  choice  of  the 
initial  assignment  solution.  Computational  experiments  showed  that  the  modified 
algorithm  outperformed  other  implementations  of  the  Hungarian  method  for  densely 
populated  rating  matrices. 

Bersekas  performed  a  more  drastic  modification  to  the  Hungarian  method.  He 
changed  the  way  that  the  cost  of  assignments  were  incremented  during  the  course  of 
the  algorithm,  which  resulted  in  a  faster  convergence  on  the  optimal  assignment.  He 
called  his  method  outpncing ,  which  effectively  reduces  the  row  operations  required  on 
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the  rating  matrix  and  results  in  solving  a  problem  of  a  smaller  dimension.  The  basic 
concept  behind  his  method  involves  cooperative  bidding  where  requesters  attempt 
to  outbid  each  other  for  the  resources  to  be  assigned  [Ber8lj. 

Bourgeois  and  Lasalle  modified  the  Munkres  version  of  the  Hungarian  method 
to  include  more  efficient  means  for  solving  assignment  problems  that  are  not  square 
(i.e.,  the  resources  do  not  equal  the  requesters)  [BoL7la,  BoL71b,  Mun57].  They 
present  a  proof  that  uses  two  submatrices,  one  consisting  of  the  real  and  the  other 
consisting  of  the  dummy  resources  or  requesters.  They  argue  that  if  the  cost  of 
the  assignments  in  the  dummy  submatrix  are  set  high  enough,  then  all  of  the  real 
resources  will  be  assigned  first.  They  conclude  that  the  addition  of  the  dummy 
elements  is  not  necessary.  They  present  an  algorithm  and  computational  comparisons 
with  the  original  Hungarian  method  to  show  that  their  method  is  a  performance 
improvement,  especially  when  dealing  with  rectangular  cost  matrices. 

3.2.6  The  Branch  and  Bound  Algorithm  The  Branch  and  Bound  algorithm 
for  the  assignment  problem  was  presented  by  Land  and  Dorg  in  1960.  It  is  a  technique 
where  a  small  portion  of  the  many  possible  combinations  of  assignments  are  selected 
and  an  objective  function  is  evaluated  subject  to  certain  bounds  or  restrictions. 
The  basic  approach  is  to  obtain  an  optimal  value  of  the  objective  function  that  lies 
between  upper  and  lower  bounds.  The  objective  function’s  value  cannot  be  less  than 
the  lower  bound.  The  upper  bound  is  normally  the  value  of  the  best  feasible  solution 
obtained  thus  far  in  the  current  algorithm  iteration.  The  algorithm  terminates  when 
it  can  be  determined  that  there  is  no  lower  bound  less  than  the  current  upper  bound 
[Ign  82].  The  branching  part  of  the  algorithm  partitions  the  solution  space  into 
smaller,  mutually  exclusive  subsets.  The  lower  bounds  associated  with  each  subset 
are  calculated  and  compared  with  the  current  value  of  the  upper  bound.  If  the 
lower  bounds  are  not  less  than  the  current  upper  bound,  then  the  subsets  are  not 
partitioned  any  further  since  no  better  solution  could  be  obtained  by  branching. 
This  branching  process  is  repeated  until  all  possible  subsets  have  been  formed  or 
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until  none  of  the  lower  bounds  are  less  than  the  current  best  feasible  solution. 

5.2.7  The  Out-of-Kilter  Algorithm  The  Out-of-Kilter  algorithm  resulted  from 
the  study  of  optimal  network  flow  and  was  presented  by  Ford  and  Fulkerson  in  1961 
[Dan63],  Its  original  application  was  to  find  either  minimal  or  maximal  flow  through 
a  network.  However,  applications  have  been  found  in  other  areas,  including  the  solu¬ 
tion  of  transportation  and  assignment  problems  [Smi82].  The  network  is  represented 
as  a  directed  graph  where  the  nodes  correspond  to  locations  and  the  arcs  represent 
links  between  different  locations.  Associated  with  each  arc  is  a  cost  per  unit  flow 
c,j  through  the  arc  which  connects  node  i  and  node  j.  The  actual  flow  through  the 
arc  is  The  out-of-kilter  algorithm  is  based  on  the  conservation  of  flow  at  all 
nodes  of  the  network:  what  flows  into  a  node  must  flow  out.  The  conservation  of 
flow  at  node  i  is  represented  by  a  multiplier  tt,.  The  v:  multipliers  of  two  nodes  are 
combined  with  the  c,:  of  the  arc  connecting  these  nodes  to  produce  the  flow  value 
xt].  Upper  and  lower  bounds  on  the  flow  x,}  can  also  be  imposed,  but  are  not  needed 
when  the  algorithm  is  used  for  solving  the  assignment  problem  [Smi82].  The  points 
(x,j,cv  +  7r,  —  7Tj)  are  plotted  to  determine  if  they  fall  on  a  “kilter  line’'  which  is 
graphically  derived  from  the  upper  and  lower  flow  bounds,  and  the  conservation  of 
flow  multipliers  for  each  node. 

The  algorithm  first  assigns  initial  flows  to  each  arc  and  then  searches  for  an  arc 
that  lies  off  the  kilter  line,  which  is  termed  being  “out-of-kilter.”  An  arc  is  brought 
into  kilter  by  adjusting  all  flows  in  the  network  from  the  source  to  the  destination 
linked  by  this  selected  arc.  This  process  is  repeated  until  all  the  arcs  are  brought  into 
kilter.  The  assignment  problem  can  be  solved  using  the  out-of-kilter  algorithm  by 
representing  the  rating  matrix  as  a  graph  where  the  nodes  correspond  to  the  resources 
and  requesters,  and  the  value  of  the  arcs  correspond  to  the  cost  of  assigning  resource 
i  to  requester  j.  The  upper  bound  on  flow  through  each  arc  must  be  set  to  infinity 
(or  a  “large”  number)  and  the  lower  bound  to  zero.  Then  an  additional  node  must  be 
added  that  is  linked  to  each  resource  and  requester  node.  The  arcs  associated  with 
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the  out-of-kilter  algorithm  is  used  to  find  the  minimal-cost  feasible  flow  through  this 
network.  The  optimal  assignment  is  reached  when  the  number  of  resources  flowing 
through  the  new  node  is  equal  to  the  number  of  resources  to  be  assigned  [Smi82]. 

3.3  Evaluation  of  Candidate  Algorithms 

In  this  section,  two  sequential  algorithms  for  solving  the  assignment  problem 
are  analyzed.  First,  the  simplex-based  transportation  method  is  evaluated.  The  de¬ 
tailed  steps  of  the  transportation  method  and  a  small  example  problem  are  worked 
out  and  included  in  Appendix  A  to  illustrate  the  algorithm.  The  Hungarian  method 
for  the  assignment  problem  is  analyzed  next.  A  detailed  explanation  of  the  Hun¬ 
garian  method  and  an  example  problem  using  the  same  cost  matrix  data  as  the 
transportation  method  example  are  also  included  in  Appendix  B  so  that  some  com¬ 
parisons  of  the  two  algorithms  can  be  made. 

3.3.1  The  Transportation  Method  Many  variations  of  Dantzig’s  simplex  method 
have  been  devised  in  order  to  solve  specific  problems.  One  modification  to  the  sim¬ 
plex  method  was  made  by  Dantzig  himself  and  it  was  done  to  allow  a  simpler  solution 
to  the  transportation  and  assignment  problems  [Chu57,  Dan63,  Ign82].  The  basic 
approach  of  the  simplex  method  was  described  in  Section  3.2.1  and  will  not  be  re¬ 
peated  here.  A  brief  overview  of  the  transportation  method  was  given  in  Section 
3.2.2  and  an  expansion  of  that  overview  is  presented  here. 

The  transportation  method  utilizes  a  table  representation  similar  to  the  cost 
matrix  described  in  the  assignment  problem  statement  of  Section  3.1  where  the  ctJ 
elements  represent  the  cost  of  assigning  resource  i  to  requester  j .  Some  modifi¬ 
cations  are  needed  which  involve  adding  another  row  and  column,  and  providing 
additional  space  for  maintaining  some  intermediate  calculations.  Also,  elements 
of  the  assignment  matrix  are  incorporated  into  this  tabular  representation  in  or- 


der  to  facilitate  improvement  on  non-optimal  assignments  during  the  course  of  the 
algorithm.  An  example  of  the  table  representation  is  shown  in  Table  3-1. 
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Table  3-1.  Example  Transportation  Table 


requester  — ♦ 
resource  [ 

1 

2 

3 

4 

a.  i 

1 

*n 

*12 

*13 

*  14 

1 

Cll 

c\2 

C\3 

C14 

2 

*21 

2  22 

*23 

*24 

1 

^21 

C22 

C23 

C24 

3 

*31 

*32 

*33 

*34 

1 

^31 

C32 

C33 

^34 

4 

*41 

*42 

X43 

*44 

1 

C41 

C42 

C43 

C44 

1 

1 

1 

1 

4 

In  cases  where  the  number  of  resources  does  not  equal  the  number  of  requests, 
"dummy"  resource  rows  or  requester  columns  with  zero  cost  elements  must  be  added 
to  the  above  tabular  representation.  The  a,  and  b3  entries  for 

these  additional  rows  or  columns  must  be  sufficient  to  balance  the  number  of 
resources  and  requesters  [Ign82]. 

There  are  two  phases  to  the  transportation  method.  The  first  phase  is  to 
formulate  the  initial  basic  feasible  solution.  The  second  phase  checks  the  initial 
solution  for  optimality  and  incrementally  improves  upon  it  until  it  is  optimal.  The 
most  difficult  portion  of  the  transportation  algorithm  is  the  search  for  the  0-paths. 
explained  in  Appendix  A,  that  are  required  for  the  assignment  of  the  t  allocations 
and  for  the  exchange  of  basic  and  non-basic  variables.  For  large  problem  sizes, 
these  operations  would  tend  to  dominate  the  computation  time.  Another  potential 
bottleneck  whose  details  are  explained  in  Appendix  A  is  the  satisfaction  of  the 
relationship  where  the  FL,  and  h,  values  are  determined  for  assigned  cells  and  the 
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values  of  the  AtJ’s  are  computed  for  the  unassigned  cells.  There  does  not  seem  to  be 
any  obvious  shortcuts  to  reduce  the  requirements  of  these  0-path  and  AtJ  operations. 

In  order  to  allow  a  comparison  with  the  Hungarian  method,  the  computational 
time  complexity  of  the  transportation  method  needs  to  be  estimated.  There  are  many 
assumptions  which  can  be  made  that  will  affect  the  complexity  estimate.  In  order  to 
simplify  the  estimate  made  here,  the  operations  that  will  be  considered  are  scanning 
a  row  or  column,  adding  or  subtracting  from  a  row  or  column,  covering  or  marking 
a  line,  and  searching  for  and  modifying  an  element  of  the  matrix.  The  operations 
carried  out  on  the  entire  matrix  will  be  the  most  costly,  while  simple  operations  on 
single  variables  are  the  least  costly.  The  worst  case  scenario  is  assumed  to  be  the 
situation  where  each  iteration  of  the  algorithm  adds  one  additional  member  to  the 
final  solution.  The  following  discussion  is  based  on  the  solution  of  an  n  x  n  matrix. 

Referring  to  Appendix  A,  Step  1-1  will  be  considered  the  overhead  step  re¬ 
quired  for  both  algorithms  and  not  considered  here.  Steps  1-2  and  1-3  will  require  n 
operations  to  locate  and  modify  the  appropriate  elements.  In  Steps  1-4  and  1-5.  all 
initial  unassigned  cells  are  independent,  so  the  object  is  to  choose  the  n  -  1  least  cost 
cells.  This  will  require  scanning  the  n  rows  n  -  1  times  and  making  n  modific  ations 
to  the  appropriate  variables  to  mark  the  positions.  The  total  number  of  operations 
for  these  steps  are  n  +  n(n  —  1 )  or  2n  ■+-  n2. 

The  next  significant  operations  occur  in  Step  2-3  where  the  A,;  equation  must 
be  solved  for  the  n  +  n  —  1  members  of  the  solution  set.  Step  2-4  requires  the 
calculation  and  assignment  of  values  to  all  elements  of  the  matrix  that  are  not  part 
of  the  solution  set  or  2(n2  —  (2n  -  1 ))  operations.  The  entire  matrix  must  be  scanned 
in  Step  2-5,  requiring  n  row  scans.  Step  2-6  in  practice  could  be  combined  with 
Step  2-5,  so  another  matrix  scan  will  not  be  included.  The  operations  required  to 
construct  the  0-path  are  more  difficult  to  estimate.  In  the  worst  case,  all  the  members 
of  the  current  solution  set  would  be  included  in  the  0-path.  The  scans  of  rows  and 
columns  to  determine  the  path  direction  and  the  determination  of  the  0  assignment 
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changes  will  require  2(2 n  -1)4-  (2n  —  1)  operations.  For  Step  2-9,  the  worst  case 
would  be  to  need  n  —  1  e  assignments.  These  t  assignments  would  each  require  at 
most  n  row  scans  to  find  the  minimum  element,  and  three  row  scans,  three  column 
scans,  and  one  assignment  for  each  of  the  n  —  1  cells  found.  Steps  2-2  through  2-9 
would  need  to  be  repeated  n  —  1  times  for  the  worst  case  scenario  assumed.  The 
total  estimated  operations  required  are: 


operations  =  n  +  (2n  +  n2)  4  (n  -  l)(3n2  +  12n  -  12) 
Equation  3-5  simplifies  to  the  following  expression: 


operations  =  3n3  +  10n2  —  21n  -f  12 
As  a  result,  the  transportation  algorithm  is  0(3n3). 


(3-5) 


(3  -  6) 


3.3.2  The  Hungarian  Method  Kuhn  presents  a  rigorous  mathematical  proof 
of  the  theory  behind  his  Hungarian  method,  which  is  primarily  based  on  one  main 
theorem  and  an  important  property  of  matrices  related  to  set  theory.  This  theorem, 
proved  by  Konig  and  generalized  by  Egervary  is: 

If  the  elements  of  a  matrix  are  divided  into  two  classes  by  a  property 
R.  then  the  minimum  number  of  lines  that  contain  all  the  elements  with 
property  R  is  equal  to  the  maximum  number  of  elements  with  the  prop¬ 
erty  R,  with  no  two  on  the  same  line  [Chu57], 

The  reference  to  a  line  means  a  row  or  column  of  a  matrix.  The  restriction  of 
no  two  elements  on  the  same  line  will  be  used  in  the  Hungarian  method  as  a  means 
to  make  the  optimal  assignment. 

The  important  property  of  matrices  presented  by  Von  Neumann  is: 

Given  a  cost  matrix  A  =  ||a,;||.  if  another  matrix  B  —  ||6U||  is  formed 
where  bXJ  =  aXJ  —  u,  —  v}  and  where  u ,  and  are  arbitrary  constants,  the 
solution  of  A  is  identical  to  that  of  B  [Chu-57]. 
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This  property  says  that  if  all  elements  in  a  row  or  column  are  increased  or  decreased 
by  the  same  amount,  then  an  equivalent  assignment  can  still  be  made  using  the 
modified  elements.  The  important  roles  of  the  theorem  and  property  are  made 
dearer  in  the  presentation  of  the  algorithm  in  Appendix  B. 

The  general  approach  of  the  Hungarian  method  involves  searching  the  rating 
matrix  for  the  minimum  values  in  each  row  or  column.  These  minimum  row  or 
column  values  are  subtracted  from  each  element  of  the  rating  matrix  to  form  a  new 
matrix  that  will  contain  a  certain  number  of  zero  elements.  These  zero  elements 
form  one  class  and  the  non-zero  elements  form  the  other  of  the  two  classes  required 
by  Konig  in  his  theorem.  If  the  minimum  number  of  lines  that  cover  all  the  null  (i.e.. 
zero)  elements  is  equal  to  the  number  of  resources  to  be  assigned,  then  the  optima] 
assignment  is  contained  in  this  set  of  null  elements.  The  method  of  obtaining  an 
optimal  assignment  from  these  null  elements  is  illustrated  by  an  example  problem  in 
Appendix  B. 

Now,  a  few  comments  on  the  computational  aspects  of  this  algorithm.  Re¬ 
ferring  to  Appendix  B,  the  Hungarian  method  requires  extensive  scanning  of  the 
rows  and  columns  of  the  rating  matrix,  which  can  be  time  consuming  for  large  prob¬ 
lems.  Some  researchers  have  developed  methods  to  reduce  the  amount  of  scanning  in 
their  implementations  of  the  Hungarian  method  [CaT80,  McG83],  This  scanning  is 
the  major  drawback  to  the  Hungarian  method.  Otherwise,  the  operations  required 
to  implement  the  algorithm  are  straightforward.  Typical  operations  are  additions, 
subtractions,  and  comparisons. 

As  in  the  transportation  algorithm  presentation,  the  computational  complexity 
of  the  Hungarian  method  also  needs  to  be  estimated  so  that  the  more  efficient  algo¬ 
rithm  can  be  chosen  for  the  parallel  implementations.  The  same  operations  will  be 
considered  in  this  case  as  in  the  previous  analysis  for  annxn  cost  matrix.  Beginning 
with  Step  1  of  the  Hungarian  method,  the  location  of  the  minimum  element  in  each 
row  requires  n  row  scans  and  the  subtraction  of  the  minimum  element  from  each  row 
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element  will  require  an  additional  n2  operations.  These  operations  are  repeated  with 
the  columns,  so  the  worst  case  number  of  operations  for  this  step  is  2 (n  +  n2).  For 
Step  2,  locating  the  row  with  one  null  element  may  require  n  row  scans  and  n  —  1 
operations  to  cross  out  other  possible  column  null  elements.  In  the  worst  case,  only 
one  such  element  would  be  found  per  iteration  of  the  algorithm.  Step  3  requires  the 
same  number  of  operations  as  Step  2.  However  either  Step  2  or  Step  3  would  be 
performed,  but  not  both.  Steps  4.1  through  4.3  depend  on  the  number  of  rows  and 
columns  already  “marked”  which  corresponds  to  the  number  of  assignments  made 
in  the  current  state  of  the  solution.  The  operations  required  would  be  n  row  scan" 
and  n  —  m  row  markings  where  m  is  the  number  of  assignments  yet  to  be  made. 
Step  4.2  will  require  scanning  n  columns  and  marking  at  most  m  columns.  Step  4.3 
requires  another  n  row  scans  and  possibly  marking  m  rows.  The  total  operations  for 
Steps  4.1  through  4.3  are  n  +  (n-m)  +  n  +  m  +  n  +  m  =  4n  +  m. 

The  next  significant  operations  occur  in  Step  4.5  where  at  most  n  rows  or 
columns  will  need  to  be  marked.  Step  5  will  vary  in  the  number  of  operations  in 
each  iteration,  but  the  entire  matrix  will  need  to  be  scanned  and  each  element  will 
be  either  subtracted  from,  added  to,  or  left  the  same  depending  on  the  location  of 
the  marked  rows  and  columns.  The  matrix  scan  will  require  n  row  searches  and  the 
operations  on  each  element  will  need  at  most  n2  steps.  Step  5  operations  total  n  +  n2 . 
With  the  exception  of  Step  1,  the  Hungarian  method  will  require  n  -  1  iterations 
to  solve  an  n  x  n  assignment  problem  if  only  one  assignment  is  made  during  each 
iteration.  There  are  other  situations  that  may  require  more  steps,  but  they  are  not 
easily  estimated.  The  estimated  total  number  of  steps  for  the  Hungarian  method  is 
as  follows: 


n—  1 

operations  =  2n  4-  n2  -f  ^  (nln  —  1 )  +  4n  +  m  +  n  -f  r?2)  (3-7) 

m  =  1 
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Simplifying  the  expression  yields  the  following  estimate  for  the  number  of 
operations  required  for  the  Hungarian  method. 

operations  =  2 n3  +  3.5 n2  4-  2.5 n  ( 3  —  8  i 

The  Hungarian  method  is  also  an  0(n 3)  algorithm,  but  the  coefficients  of 
the  expression  are  less  than  the  transportation  method.  The  implications  of  these 
analyses  will  be  discussed  in  Section  3.5. 

3.4  Parallel  Combination  Strategies 

There  are  many  ways  to  combine  the  solutions  to  several  subproblems  into  an 
overall  problem  solution  [Qui87],  This  section  examines  three  techniques  that  have 
been  developed  to  perform  this  important  task.  The  names  of  thes  ‘  techniques  are 
more  commonly  recognized  as  those  of  sequential  algorithms,  but  they  have  been 
recently  developed  into  high-level,  parallel  strategies  for  solving  problems  involving 
combinatorial  search  [HoZ83,  WaL85.  Qui87],  The  assignment  problem  belongs  to 
this  class  of  combinatorial  search  problems,  defined  by  Wah  as  the  process  of  finding 
“one  or  more  optimal  or  suboptimal  solutions  in  a  defined  problem  space"  [\YaL85  . 
The  objective  of  this  section  is  to  describe  and  evaluate  these  parallel  combination 
techniques.  Selection  of  the  high-level,  parallel  combination  strategy  to  be  used  in 
combination  with  the  selected  node  process  algorithm  is  made  in  Section  3.5. 

3-4-1  Branch  and  Bound  The  basic  approach  of  the  branch-and-bound  tech¬ 
nique  is  the  systematic  search  of  an  OR-tree  representation  of  the  problem  solution 
space.  The  branch-and-bound  technique  begins  with  an  initial  problem  and  some 
objective  function  which  must  be  either  minimized  or  maximized.  It  first  attempts 
to  solve  the  problem  directly.  If  this  is  not  possible  because  the  problem  is  too  large 
to  be  solved  in  a  reasonably  short  time,  then  the  problem  is  divided  into  smaller 
subproblems.  With  each  subproblem,  constraints  in  the  form  of  upper  and  lower 
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bounds  are  included.  This  process  continues  until  all  the  subproblems  have  been 
either  completely  decomposed  and  solved  or  the  problem  is  shown  to  be  unbounded 
[Qui87j .  In  a  parallel  branch-and-bound  technique,  many  of  the  functions  involving 
decomposing,  solving  subproblems,  and  evaluating  constraints  can  be  done  in  paral¬ 
lel  AVaL85],  In  a  multiprocessor,  the  decomposed  subproblems  are  each  assigned  to 
individual  processors  for  parallel  solution.  A  majority  of  the  individual  processors 
run  identical  node  process  programs.  In  most  cases,  a  centralized  controller,  hosted 
on  one  or  more  processors,  is  used  to  expand  the  problem  nodes  to  be  examined  and 
determine  the  conditions  for  terminating  the  overall  process. 

Lai  and  Sahni  examined  some  anomalies  in  parallel  branch-and-bound  algo¬ 
rithms.  They  observed  that  theoretically,  faster  speedup  is  possible  with  a  smaller 
number  of  processors.  Experimental  results  with  a  parallel  implementation  to  solve 
the  0-1  knapsack  problem  confirmed  the  theories  they  presented,  although  they  com¬ 
mented  that  in  practice  the  anomalies  would  rarely  show  up,  except  for  small  problem 
sizes.  Mraz  also  encountered  the  same  type  anomaly  in  his  parallel  branch-and-bound 
implementation  of  the  N-queens  problem  solution  on  th?  Intel  hypercube.  In  his  re¬ 
sults  for  the  8-queens  problem,  16  processors  solved  the  problem  in  1.1  seconds  while 
32  processors  required  2.2  seconds  [Mra86],  This  indicates  that  there  are  significant 
overheads  involved  with  implementing  branch-and-bound  techniques  in  a  parallel 
environment . 


•i.4-2  Alpha-Beta  Starch  The  alpha-beta  method  involves  the  search  of  an 
AND/OR  tree  representation  of  the  problem  solution  space.  Search  of  an  AND/OR 
tree  is  more  complicated  because  it  combines  the  techniques  of  branch-and-bound 
just  described  and  divide-and-conquer  which  will  be  presented  in  the  next  section. 
Alpha-beta  is  a  method  usually  employed  in  the  solution  of  two-person  zero-sum 
games  like  chess  and  checkers  [Qui87], 

The  basic  approach  of  the  alpha-beta  search  is  to  consider  the  present  state 


40 


of  the  problem  solution,  evaluate  a  number  of  possible  alternative  decisions,  and 
then  incorporate  those  alternative  decisions  that  result  in  the  most  advantageous 
solution  to  the  present  problem.  Two  parameters,  a  and  3,  define  a  search  “window" 
which  is  used  to  prune  subtrees  from  the  solution  tree  that  do  not  contribute  to 
optimal  solutions.  Parallel  alpha-beta  algorithms  typically  assign  different  windows 
to  each  processor  so  that  faster  and  deeper  searches  of  the  AND/OR  tree  can  be 
accomplished  [SeB82].  One  problem  with  parallelization  of  the  alpha-beta  search  is 
that  extensive  communications  must  be  used  between  processors  to  update  the  search 
window.  If  communications  are  reduced  or  eliminated,  the  result  is  other  overheads 
related  to  processors  needlessly  searching  through  nodes  of  the  tree  determined  not 
optimal  by  another  node.  There  is  a  tradeoff  between  reducing  communications  and 
processing  efficiency  in  the  parallel  implementation  of  the  alpha-beta  search  method. 

3.4-3  Divide-and-Conquer  Unlike  branch-and-bound  or  alpha-beta  strate¬ 
gies.  the  divide-and-conquer  strategy  searches  an  AND  tree  representation  of  the 
problem  solution  space  [Qui87],  Every  subproblem  solution  is  actually  a  part  of  the 
overall  solution,  which  differs  from  the  other  search  techniques  where  many  sub- 
problem  solutions  are  discarded.  Divide-and-conquer,  as  its  name  implies,  divides 
a  problem  into  smaller  subproblems  that  can  be  solved  faster  and  easier  than  the 
larger,  overall  problem.  Once  all  of  these  subproblems  are  solved,  the  results  are 
combined  to  form  the  solution  to  the  original  problem  [HoZ83].  Parallel  divide-and- 
conquer  depends  on  the  node  processes  to  determine  the  feasibility  or  optimality  of 
the  subproblem  solution. 

An  important  factor  in  the  performance  of  the  divide-and-conquer  search  is  the 
“granularity  of  parallelism”  which  is  simply  the  minimum  overall  problem  partition 
size  [Wal85].  Problem  partitioning  was  emphasized  in  Chapter  2  as  an  important 
consideration  in  mapping  problem  solutions  onto  parallel  computers.  Another  con¬ 
sideration  in  the  parallel  implementation  of  divide-and-conquer  is  the  processor  uti¬ 
lization.  The  three  phases  of  parallel  divide-and-conquer  are  start  up,  computation. 
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and  wind-down  [WaL85],  In  the  start-up  phase,  the  initial  problem  is  partitioned 
and  the  resulting  subproblems  are  sent  to  the  individual  processors.  During  the 
computation  phase,  the  processor  utilization  is  typically  very  good.  However,  dur¬ 
ing  the  wind-down  phase,  many  processors  remain  idle  while  the  transferring  and  the 
combining  of  subproblem  results  occur.  This  results  in  tradeoffs  between  problem 
partition  size  and  processor  utilization.  Larger  partitions  mean  longer  time  spent  in 
the  computation  and  better  processor  utilization.  But  larger  partitions  also  reduce 
the  amount  of  parallelism  and  limit  the  speedup  possible  over  sequential  algorithms. 

An  advantage  of  the  divide-and-conquer  over  the  other  techniques  is  that  in¬ 
terprocessor  communications  can  be  very  minimal  during  the  computational  phase 
without  any  performance  degradation.  This  is,  of  course,  dependent  on  the  type  of 
problem  being  solved.  In  the  wind-down  phase,  transferring  of  subproblem  results  to 
be  combined  into  the  overall  solution  can  be  viewed  as  communications.  However, 
these  communications  do  not  interfere  with  the  process  running  in  the  individual 
processors  because  at  this  time,  they  have  already  terminated. 

3.5  Results  of  the  Analyses 

In  this  section,  the  results  of  the  preceeding  analyses  are  summarized.  The 
algorithm  to  be  used  as  a  node  process  and  the  parallel  combination  technique  to  be 
used  will  be  selected.  The  tentative  form  of  interprocessor  communications  is  then 
be  devised.  A  more  definitive  communications  protocol  is  established  in  the  following 
chapter  where  the  actual  implementation  process  is  described.  The  algorithm,  the 
search  technique,  and  the  interproceSssor  communications  are  used  as  a  basis  for 
completing  the  implementation  of  the  parallel  assignment  algorithms. 

3.5.1  Selection  of  Search  Technique  All  three  search  techniques  presented 
have  advantages  and  disadvantages.  The  branch  and  bound  method  employed  by 
Lai.  Sahni,  and  Mraz  exhibits  some  anomalous  behavior  related  to  problem  size  and 
algorithmic  overheads.  I  he  cure  for  this  behavior  was  larger  problem  partition  size. 
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which  resulted  in  reduced  parallelism.  The  alpha-beta  search  technique  suffers  from 
sensitivity  to  the  amount  of  interprocessor  communications.  The  efficiency  of  the 
solution  space  search  is  inversely  proportional  to  the  amount  of  communications. 
Divide-and-conquer  performance  is  also  affected  by  the  problem  partition  size,  but 
the  effect  is  reduced  processor  efficiency  and  not  additional  algorithmic  overheads  as 
in  the  branch-and-bound  method.  The  amount  of  required  interprocessor  commu¬ 
nications  in  the  divide-and-conquer  method  is  potentially  the  smallest  of  the  three 
search  techniques  because  the  individual  node  processes  are  relatively  independent, 
except  for  the  combining  of  subproblem  results.  Because  the  divide-and-conquer 
method  is  less  sensitive  to  problem  partition  size  and  interprocessor  communications 
than  the  other  search  techniques,  it  will  be  the  parallel  search  technique  employed 
in  the  implementations  developed  in  Chapter  4. 

3.5.2  Selection  of  Candidate  Algorithm  In  Appendices  A  and  B,  the  sequen¬ 
tial  transportation  and  Hungarian  methods  for  solving  the  assignment  problem  were 
presented  in  detail.  In  this  section,  one  of  them  is  selected  as  a  basis  for  the  node 
process  program.  The  problem  areas  that  are  considered  in  the  selection  are  the 
algorithm’s  complexity,  partitionability,  and  expected  level  of  interprocess  commu¬ 
nications. 

In  an  analysis  and  comparison  of  the  computational  complexity  of  simplex- 
based  algorithms  and  the  Hungarian  method,  Bertsekas  says  a  fully  dense,  all  integer. 
-V  x  .Y  assignment  problem  solution  using  the  Hungarian  method  is  0(.Y3).  He 
further  states  that  there  is  “no  simplex  type  method  with  complexity  as  good  as 
OfiY3)”  [Ber8l].  Some  rules  used  for  selecting  entering  variables  in  the  simplex 
method  have  been  shown  to  lead  to  exponentially  long  sequences  of  computations 
[Hun83].  These  statements  would  tend  to  lead  one  to  chose  the  Hungarian  method 
over  the  simplex  method.  There  are  other  factors,  however,  that  must  be  considered 
before  deciding  on  the  “best"  method. 
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Several  comparisons  of  simplex-based  and  primal-dual  based  (i.e.,  Hungarian) 
methods  of  solving  the  assignment  problem  have  been  made  [Hat75,  G1K74,  McG83]. 
There  is  general  agreement  that  an  efficient  implementation  of  the  Hungarian  method 
is  better  overall  than  using  the  simplex  method.  One  reason  for  this  is  that  the  Hun¬ 
garian  method  is  an  algorithm  developed  and  optimized  especially  for  solving  the 
assignment  problem,  while  the  simplex  method  is  a  more  general  method  that  can  be 
used  to  solve  a  variety  of  linear  programming  problems  which  cannot  be  solved  by  the 
Hungarian  method.  Although  variants  of  the  simplex  method  have  been  developed 
for  the  assignment  problem,  they  still  encounter  difficulties  with  examining  nodes 
that  do  not  lead  to  the  optimum  solution  [Hat75,  BaG77,  Hun83].  Simplex  methods 
are  generally  more  suitable  for  the  transportation  problems  discussed  in  section  3.1 
where  the  number  of  non-degenerate  arcs  between  nodes  is  less  because  of  multiple 
resources  and  multiple  requests  by  each  requester.  In  a  comparison  of  minimum-cost 
network  flow  problem  solutions,  a  specialized  simplex-based  code  was  shown  to  out¬ 
perform  other  codes  which  included  a  primal-dual  code,  of  which  the  Hungarian  is 
special  case  [G1K74].  But  there,  the  main  emphasis  was  on  transportation-type  prob¬ 
lems  rather  than  assignment  problems  where  the  simplex  method  seems  to  perform 
worse. 

Both  methods  use  similar  matrix  representations  of  the  initial  problem,  the 
intermediate  results,  and  the  final  solution.  The  partitioning  of  the  problem  rep¬ 
resentations  of  both  methods  is  essentially  the  same  because  of  the  similar  matrix 
representation.  If  the  partitions  of  the  square  cost  matrix  are  in  the  form  of  square 
sub-matrices,  then  information  on  both  the  costs  of  assigning  to  each  requester  a 
particular  resource  and  the  cost  of  assigning  each  resource  to  a  certain  requester 
will  be  incomplete.  However,  if  the  cost  matrix  is  partitioned  into  “strips,"  each 
processor  will  either  have  complete  cost  information  on  the  assignment  of  a  group  of 
resources  to  all  requesters  or  a  group  of  requesters  to  all  resources.  The  availability 
of  complete  cost  information  will  have  an  effect  on  the  optimality  of  the  assignments 
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made  and  the  amount  of  assignment  coordination  required  by  the  individual  node 
processes.  If  the  problem  is  partitioned  into  “strips,”  both  methods  use  exactly  the 
same  technique  of  “dummy”  variables  to  form  the  required  matrix  format  where  the 
number  of  resources  must  equal  the  number  of  requesters.  The  performance  of  the 
two  algorithms  will  be  affected  in  much  the  same  manner  by  the  partition  type.  The 
square  partition  appears  to  potentially  require  more  interprocess  communications 
than  the  rectangular  strip  partition. 

The  format  and  type  of  communications  that  will  be  used  between  processors 
are  discussed  in  Section  3.5.3  and  developed  in  Chapter  4.  However,  some  assessment 
needs  to  be  made  of  the  level  of  communications  that  might  be  required  by  the 
transportation  and  Hungarian  methods  in  order  to  develop  a  node  process  with 
minimal  communications.  The  volume  of  communications  will  depend  strongly  on 
the  partition  size  and  type  in  both  methods.  As  mentioned  in  the  previous  paragraph, 
the  strip  size  will  affect  the  amount  of  communications  required  to  either  obtain  cost 
information  or  coordinate  the  assignments.  Because  the  problem  representations  and 
solution  results  are  similar,  the  level  of  communications  is  expected  to  differ  very 
little  between  the  two  methods. 

Because  the  partitionability  and  the  communications  requirements  are  very 
similar  for  both  algorithms,  the  selection  for  use  in  the  parallel  algorithm  must  be 
based  on  some  other  criteria.  Fox  feels  that  the  processes  running  on  the  individual 
nodes  of  a  multiprocessor  should  essentially  perform  the  same  operations  as  the 
sequential  version  of  the  algorithm  [Fox84,  Fo084].  For  this  reason,  it  is  reasonable 
to  select  the  most  efficient  sequential  algorithm  for  parallelization,  provided  that  all 
other  factors  are  nearly  the  same.  The  performance  comparison  by  Bersekas  of  the 
Hungarian  and  simplex-based  methods  indicates  that  the  Hungarian  method  is  more 
efficient.  In  Section  3.3,  the  complexity  analysis  showed  that  the  Hungarian  method 
would  require  only  2/3  as  many  steps  as  the  transportation  method  to  solve  the 
same  size  problem  (i.e.  the  Hungarian  method  is  potentially  33%  faster).  Based  on 


the  previous  discussion  in  this  section  and  the  complexity  advantage,  the  Hungarian 
method  is  selected  as  the  basis  for  the  node  process  program. 

S.5.S  Interprocess  Communications  Protocol  Based  on  the  discussion  in  Chap¬ 
ter  2,  the  development  of  a  communications  protocol  that  both  minimizes  the  amount 
of  communications  and  permits  the  efficient  transfer  of  required  information  between 
processors  is  the  most  crucial  aspect  of  developing  a  parallel  software  implementa¬ 
tion.  The  analysis  of  the  divide-and-conquer  technique  pointed  out  that  the  inter- 
processor  communications  were  minimal  in  the  computation  phase,  depending  on 
the  type  problem  being  solved.  One  type  of  communication  envisioned  is  that  the 
cost  of  assigning  a  particular  resource  to  a  requester  will  require  knowledge  of  the 
assignment  cost  for  all  the  requesters.  The  communication  of  cost  information  could 
be  eliminated  by  storing  the  needed  information  in  all  processors.  This  concept  is 
explored  further  in  Chapter  4.  At  this  point  in  the  development,  it  is  not  clear 
whether  this  method  is  feasible. 

The  assignment  problem  will  also  require  some  degree  of  communications  be¬ 
tween  processors  so  that  the  assignments  made  by  other  processes  can  be  checked  to 
see  if  the  same  requester  was  assigned  more  than  one  resource.  However,  this  com¬ 
munication  would  occur  after  an  iteration  of  the  assignment  algorithm  was  complete. 
If  conflicts  are  present,  then  a  form  of  bidding  would  need  to  take  place  where  the 
lowest  cost  assignment  to  a  requester  would  stand  and  all  processors  that  assigned 
other  resources  to  the  same  requester  would  need  to  recompute  another  assignment 
without  considering  the  conflicting  requester.  This  will  obviously  require  that  after 
each  assignment  by  a  node  processor,  the  individual  assignments  would  need  to  be 
broadcast  to  other  processors  to  determine  if  any  conflicts  exist.  If  none  exist,  the 
assignment  stands.  Otherwise,  the  bidding  process  would  occur  to  resolve  the  con¬ 
flicting  assignments.  There  are  several  unanswered  questions  about  the  exact  form 
of  the  interprocessor  communications.  But  the  concepts  just  presented  should  form 
a  basis  that  can  be  further  refined  in  the  implementation  process  that  follows. 
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3.5.4  The  Parallel  Assignment  Algorithms  The  general  scheme  of  the  par¬ 
allel  assignment  algorithms  are  based  on  the  divide-and-conquer  as  the  high-level 
parallel  search  strategy  and  the  sequential  Hungarian  method  as  the  node  process 
program.  Each  processor  in  the  Intel  hypercube  runs  a  version  of  the  sequential 
Hungarian  algorithm  that  has  been  modified  to  include  a  means  of  communicating 
with  other  processors  and  the  cube  manager.  The  exact  form  of  the  interprocessor 
communications  is  not  fully  defined  at  this  point,  but  the  majority  of  the  communi¬ 
cations  involve  the  exchange  of  information  to  resolve  conflicts  in  assignments  after 
each  complete  iteration  of  the  Hungarian  algorithm  in  the  node  processors.  The 
implementations  in  Chapter  4  use  the  general  scheme  described  here  and  refine  it  as 
necessary. 


3.6  Summary 

This  chapter  has  covered  the  development  of  the  parallel  assignment  algorithm, 
beginning  with  a  formal  definition  of  the  assignment  problem.  The  major  sequen¬ 
tial  algorithms  developed  for  solving  the  assignment  problem  were  briefly  described, 
followed  by  a  detailed  presentation  of  the  transportation  and  Hungarian  algorithms. 
The  Hungarian  and  transportation  methods  were  compared  in  terms  of  computa¬ 
tional  complexity  and  suitability  for  parallelization.  Then  three  methods  of  parallel 
search  of  a  problem  solution  space  were  explored.  In  the  concluding  section,  the 
Hungarian  method  was  chosen  as  the  basis  for  the  node  program  to  be  developed  in 
Chapter  4.  The  divide-and-conquer  technique  was  chosen  as  the  high-level  parallel 
search  method  to  be  incorporated  into  the  parallel  assignment  algorithm.  And  fi¬ 
nally,  the  groundwork  for  the  communications  protocol  was  established.  Chapter  4 
continues  the  process  of  developing  the  implementation  of  a  parallel  weapon-target 
assignment  algorithm  on  the  Intel  hypercube  computer  by  utilizing  the  results  of 
this  chapter  in  the  design  and  coding  of  the  software. 
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4 .  Implementation  of  the  Assignment  Algorithms 


This  chapter  utilizes  the  background  work  done  in  the  preceeding  chapters 
to  develop  implementations  of  parallel  assignment  algorithms  for  the  Intel  iPSC 
parallel  computer.  First,  detailed  assumptions  are  presented  to  more  closely  define 
the  problem  space  being  considered,  followed  by  a  definition  of  the  experimental 
model.  A  brief  description  of  a  Ballistic  Missile  Defense  (BMD)  simulation  program 
[Odo85]  developed  for  the  U.S.  Army  and  other  government  agencies  is  then  given, 
followed  by  an  explanation  of  the  method  employed  to  generate  input  data  for  the 
programs  developed  in  this  research. 

After  fully  developing  the  experimental  model  and  the  means  to  generate  plau¬ 
sible  input  data,  the  development  of  two  sequential  assignment  implementations  is 
described.  The  purpose  of  the  first  sequential  program,  named  the  “sorting  method." 
is  to  establish  a  baseline  for  comparison  with  all  of  the  other  sequential  and  parallel 
assignment  algorithms.  The  second  sequential  implementation,  which  is  based  on  a 
version  of  the  Hungarian  method  developed  by  Bourgeois  and  Lassalle  [BoLTla],  is 
known  as  the  “sequential  B&L  algorithm.”  The  sequential  B&L  algorithm  is  also 
used  for  comparison  with  the  parallel  implementations.  Portions  of  it  later  become 
an  integral  part  of  the  parallel  algorithms. 

The  development  of  four  different  parallel  implementations  of  an  assignment 
algorithm  is  presented  next.  Each  parallel  implementation  uses  a  different  level  of 
interprocessor  communications  in  order  to  allow  a  study  of  its  effect  on  performance 
measures  such  as  computation  times,  speedup,  load  balancing,  and  assignment  costs. 
The  description  of  each  implementation  states  the  objectives,  outlines  the  develop¬ 
ment  approach,  defines  the  software  modules  and  interfaces,  describes  the  organiza¬ 
tion  and  major  sections  of  the  program,  and  estimates  the  computational  complexity. 
This  chapter  concludes  with  a  summary  of  the  implementations  developed. 


4-1  The  Experimental  Model 


In  Chapter  1.  a  number  of  assumptions  were  stated  concerning  the  deployment 
and  operation  of  the  BM/C3  battle  management  system.  This  section  expands  on 
those  assumptions  and  states  them  in  more  detail.  Then  the  battle  management 
portion  of  the  BM/C3  system  being  modeled  is  precisely  defined.  A  background 
and  brief  description  of  the  BMD  simulation  program  is  given  in  order  to  further 
define  the  scope  and  limitations  of  the  experimental  model.  This  section  concludes 
with  the  development  of  a  simpler  program  to  generate  data  similar  to  the  attack 
scenarios  generated  by  the  BMD  simulation  program. 

4-1.1  Assumptions  The  main  point  of  the  assumptions  stated  in  Section  l.G 
was  that  this  study  focuses  on  one  specific  task  of  the  battle  management  function. 
That  function  is  the  weapon-to-target  assignment.  All  other  functions  related  to  the 
management  of  the  weapons  and  other  resources  are  assumed  to  be  handled  by  other 
“modules’'  or  components  of  the  system.  The  optimal  assignment  of  weapons  cannot 
be  accomplished  unless  certain  information  from  these  ovher  modules  is  available. 
Information  such  as  the  range  to  the  target,  the  type  of  target,  the  weapon-to-target 
impact  angle,  the  expected  impact  area  of  the  target,  the  status  and  position  of  all 
weapons,  and  several  other  factors  are  all  needed  to  derive  the  “cost”  of  assigning 
each  weapon  to  each  potential  target.  The  data  collection  and  evaluation  activities 
required  to  derive  these  individual  assignment  costs  are  assumed  to  be  performed  by 
the  other  modules  and  made  available  to  the  assignment  module  of  the  battle  man¬ 
agement  system.  The  assignment  module  is  further  assumed  to  be  memoryless.  This 
means  that  each  assignment  iteration  is  based  only  on  the  current  cost  information 
provided  to  it  and  is  unaffected  by  previous  assignments.  However,  the  assignment 
process  may  choose  to  allow  certain  weapons  to  remain  idle  for  future  use  if  the 
present  cost  of  utilization  is  considered  too  high. 


4-1.2  Model  Definition  The  system  model  used  in  this  research  is  not  geared 
towards  any  specific  type  of  weapon.  “Generic"  weapons  are  assumed  to  be  deployed 
on  space-based  platforms  orbiting  the  earth.  These  weapons  are  also  assumed  to  be 
"single  shot"  in  the  sense  that  during  each  assignment,  each  weapon  can  be  assigned 
to  only  one  target.  For  weapons  with  multiple  targeting  capability,  multiple  assign¬ 
ments  would  occur  over  several  iterations  of  the  assignment  task  with  other  modules 
accounting  for  factors  such  as  slew  rates  and  retargeting  capabilities.  Each  itera¬ 
tion  of  the  assignment  task  is  a  single  “snapshot"  in  the  overall  battle  management 
process  that  would  occur  in  the  interception  of  ballistic  missiles.  This  model  is  not 
intended  to  account  for  all  factors  involved  with  the  BN1/C3  task,  but  rather  to 
address  the  major  issues  that  affect  the  critical  assignment  task. 

4-1-3  The  Ballistic  .Missile  Defense  Simulation  Program  Several  simulation 
programs  have  been  developed  in  the  past  to  model  the  development  and  deployment 
of  ballistic  missile  defense  systems  [Odo85,  Cur87],  One  recent  simulation  program 
is  the  result  of  work  performed  under  contract  to  the  Defense  Advanced  Research 
Projects  Agency  (DARPA)  and  the  U.S.  Army.  DESE  Research  and  Engineering 
was  tasked  to  develop  a  software  and  graphics  package  to  aid  in  the  research  and 
development  of  BMD  and  Anti-S ATellite  (ASAT)  programs  [Odo85].  The  main 
objective  of  this  project  was  to  utilize  interactive  graphics  to  aid  in  assessing  the 
performance  of  proposed  scenarios  and  weapon  deployments.  A  FORTRAN-based 
testbed  program  was  written  to  model  the  significant  physical  parameters  of  the 
problem  and  generate  data  to  be  used  in  developing  the  graphics  driver  program 
written  in  a  version  of  the  FORTH  language.  The  testbed  program  combined  the 
results  of  earlier  research  and  provided  a  first  order  simulation  of  the  events  expected 
to  occur  in  a  full-scale  global  engagement.  The  primary  weapon  system  modeled  in 
this  simulation  program  was  a  combined  ground-based  laser  and  space-based  relay 
mirror  arrangement.  The  engagement  scenarios  generated  were  plausible  because 
the  enemy  missile  trajectories  were  calculated  as  originating  from  known  missile 


sites  in  the  Soviet  Union  and  terminating  in  major  cities  and  military  complexes  in 
the  United  States.  The  altitudes  and  orbits  of  the  relay  mirrors  were  also  simulated 
and  the  visibility  of  each  mirror  to  the  ground-based  lasers  were  determined  using 
three-dimensional  coordinates,  the  rotation  of  the  earth,  orbital  mechanics,  weather 
conditions,  and  several  other  factors. 

4- 1-4  Method  of  Input  Data  Generation  The  BMD  simulation  program  briefly 
described  above  focused  on  one  particular  type  of  weapon  system.  One  component 
of  that  weapon  system  was  a  constellation  of  space-based  relay  mirrors.  Certain  sub¬ 
routines  of  the  program  calculate  the  distance  between  a  relay  mirror  and  a  target, 
and  the  incident  angle  of  a  laser  directed  at  a  particular  target.  These  two  pieces  of 
information  can  also  be  used  as  basic  parameters  for  an  entirely  space-based  weapon 
system.  However,  the  BMD  program  is  not  capable  of  producing  more  than  20  feasi¬ 
ble  weapon-to-target  “links”  per  snapshot.  This  is  much  too  low  to  be  useful  because 
the  assignment  algorithms  studied  in  Chapter  3  would  treat  this  small  number  of 
weapons  as  a  trivial  case. 

The  distance  and  angle  data  are  still  very  useful,  even  though  the  quantity 
of  data  is  insufficient.  The  testbed  program  was  modified  to  store  this  information 
during  the  execution  of  a  typical  full-scale  attack  simulation.  Representative  values 
of  the  distances  and  angles  were  used  as  a  guide  to  develop  similar  and  more  extensive 
data  by  means  of  a  much  simpler  program.  One  heuristic  used  in  the  BMD  simulation 
program  to  indicate  a  potentially  good  assignment  was  the  weapon-to-target  distance 
multiplied  by  the  cosine  of  the  impact  angle.  This  produces  low  values  for  impact 
angles  close  to  90  degrees  and  high  values  for  angles  near  zero.  For  minimum  cost 
assignments,  this  heuristic  is  expected  to  work  reasonably  well  for  other  types  of 
directed  energy  weapons  that  are  likely  to  be  deployed. 

The  data  generation  program  developed  in  this  research  uses  a  random  number 
generator  to  produce  values  in  the  range  of  1  to  2500,  which  correspond  to  the  range 


of  values  for  the  heuristic  just  described.  Although  the  data  is  somewhat  random, 
provisions  were  made  to  produce  lower  values  in  some  sections  of  the  cost  matrix 
and  higher  values  in  others.  The  lower  values  correspond  to  a  high  probability  of  kill 
and  low  cost  assignments.  The  higher  values  indicate  low  probability  of  kill  and  high 
cost  assignments  (i.e.,  long  distances,  small  angles  of  impact).  The  groupings  of  low 
and  high  values  are  intended  to  represent  groups  of  weapons  that  have  similar  oppor¬ 
tunities  for  engaging  the  same  targets.  One  additional  factor  that  can  be  accounted 
for  is  the  number  of  reentry  vehicles  (RV)  contained  in  a  particular  booster-phase 
target.  An  individual  cost  value  can  be  lowered  further  by  a  factor  of  the  number  of 
RYs,  w-hich  make  that  particular  target  more  likely  to  be  engaged  by  the  assignment 
algorithm. 

4.2  Sequential  Assignment  Algorithm  Implementations 

Two  sequential  assignment  implementations  are  described  in  this  section.  The 
first  one,  called  the  sorting  method,  is  typical  of  the  methods  used  in  battle  man¬ 
agement  simulation  programs  to  assign  weapons  to  targets  and  does  not  provide 
the  optimal  assignment  [Odo85,  Cur87],  The  second  one,  called  the  sequential  B&L 
algorithm,  is  based  on  a  version  of  the  Hungarian  method  developed  by  Bourgeois 
and  Lassalle  which  does  provide  the  minimum  cost  optimal  assignment  of  available 
weapons. 

4-2.1  The  Sorting  Method  The  purpose  of  this  assignment  implementation 
is  to  provide  a  baseline  that  is  relatively  easy  to  implement  and  can  be  used  for 
comparison  with  more  efficient  assignment  problem  solutions.  It  illustrates  how  a 
commonly  used,  simple  approach  to  the  assignment  problem  solution  can  be  very 
time  consuming.  The  basic  approach  to  this  program,  as  the  name  implies,  is  sorting. 
The  input  data  generated  by  the  program  described  in  Section  4.1.4  is  normally 
stored  as  a  rating  matrix,  for  this  application,  the  data  is  reordered  into  a  list  or 
one  dimensional  array  format  to  allow  the  cost  values  to  be  sorted  in  ascending  order. 


The  row  and  column  values  associated  with  each  cost  entry  are  accessible  through 
corresponding  arrays.  Once  the  list  is  sorted,  the  lowest  value  is  selected  and  the 
associated  weapon  and  target  (row  and  column)  are  marked  as  “assigned."  Then 
the  next  lowest  cost  in  the  list  is  selected  and  if  both  the  weapon  and  the  target 
associated  with  this  value  are  also  unassigned,  the  assignment  is  made.  However,  if 
either  the  weapon  or  the  target  is  already  assigned,  the  next  lowest  value  in  the  list 
is  examined  for  possible  assignment  and  so  on.  This  process  continues  until  all  the 
weapons  are  assigned. 

At  first,  this  method  appears  to  offer  the  lowest  possible  overall  assignment 
cost.  However,  this  is  not  the  case.  In  many  cases,  a  lower  cost  assignment  is  se¬ 
lected  for  a  particular  weapon-target  pair.  This  eliminates  the  possibility  of  using 
the  same  weapon  or  target  in  a  later  assignment  which,  although  some  of  the  indi¬ 
vidual  assignment  costs  may  be  higher,  the  effect  would  be  a  lower  overall  cost.  The 
results  of  the  example  problems  in  Appendices  A  and  B  illustrate  this  point.  If  an 
assignment  was  made  using  the  same  cost  data  given  in  the  example  with  the  sorting 
method,  the  overall  cost  could  have  ranged  from  the  optimum  on  up  to  a  value  of 
15,  depending  on  how  the  list  was  sorted. 

The  program  implementing  this  algorithm  is  actually  split  into  two  portions. 
The  first  portion  is  hosted  on  the  cube  manager  of  the  Intel  iPSC.  Its  function  is 
to  prompt  the  user  for  problem  size  information,  access  the  file  containing  the  input 
data,  and  compile  post-run  statistics.  The  other  portion  of  the  program  is  hosted  on 
one  of  the  node  processors  of  the  iPSC.  The  cube  manager  or  host  program  sends 
the  problem  size  parameters  and  the  cost  data  to  the  node  program.  The  node 
program  sorts  the  cost  data  using  a  relatively  quick  Shell  sort  [KeRTS]  and  performs 
the  process  previously  described  to  make  the  assignments.  Once  the  assignments 
are  complete,  the  node  program  sends  the  assignments  list  and  timing  information 
to  the  host  program  for  further  processing  and  display.  The  performance  results  of 
this  implementation  are  presented  and  analyzed  in  Chapter  5. 

53 


I 


The  basic  limiting  factor  of  this  method  is  the  sorting  of  the  cost  list.  Some 
»  simulation  programs  use  heuristic  methods  to  reduce  the  size  of  this  list  in  order  to 

shorten  the  time  required  to  sort  the  list.  The  Shell  sort  is  an  0(  A'  x  \/K )  procedure 
Fel$5\  When  the  assignment  process  is  performed  on  the  sorted  list,  it  will  be  at 
least  0(  A)  because  .V  weapons  need  to  be  assigned.  In  the  worst-case.  A  2  operations 
will  be  required  to  make  the  assignments  using  the  sorted  list.  Overall,  this  sorting 
method  of  solving  the  assignment  problem  appears  to  be  0{ A'3-5). 

d  4.2.2  The  Bourgeois  and  Lassalle  Algorithm  The  Hungarian  method,  se¬ 

lected  in  Chapter  3  as  the  basis  for  the  node  process  of  the  parallel  assignment 
algorithm,  can  be  found  in  many  forms  in  the  literature.  The  Bourgeois  and  Lassalle 
-  B Ac L  1  algorithm  is  one  variation  of  the  Hungarian  method  [BoLTlb],  It  is  chosen 
for  implementation  because  it  handles  the  case  of  non-square  cost  matrices  without 
the  addition  of  dummy  variables  mentioned  in  the  presentation  of  the  Hungarian 
method.  In  a  realistic  scenario,  there  will  be  many  more  targets  than  there  are 
^  weapons.  The  basic  operations  of  the  B&L  algorithm  are  the  same  as  those  illus¬ 

trated  m  the  example  problem  in  Appendix  B  with  the  addition  of  some  pointer 
arrays  to  keep  track  of  certain  assigned  weapons  and  targets. 

to  The  basic  approach  to  solving  the  assignment  problem  in  this  implementation 

is  to  use  the  B<CL  algorithm  as  a  function  call  within  the  same  basic  framework 
as  the  sorting  method  program.  The  problem  size  parameters  and  input  cost  data 
are  handled  by  a  cube  manager  host  program.  The  actual  assignment  is  performed 
by  a  node  program  and  the  results  are  sent  back  to  the  host  program  for  post¬ 
run  processing.  The  operations  required  to  formulate  the  optimal  assignment  are 
contained  within  the  B<L'L  algorithm  and  are  illustrated  in  the  example  problem  of 
Appendix  B.  The  cost  matrix,  the  number  of  weapons,  and  the  number  of  targets 
are  all  supplied  as  parameters  to  the  assignment  function  call.  The  function  returns 
art  array  indexed  from  one  to  the  number  of  available  weapons  and  the  total  cost  of 
the  assignment.  Each  entry  in  the  array  is  the  target  number  to  which  the  weapon 
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(the  array  index  number)  is  assigned. 

This  program  will  provide  a  minimum-cost,  optimal  assignment.  The  overall 
cost  of  the  assignment  is  considered  when  each  individual  assignment  is  made,  unlike 
the  sorting  method  where  only  the  individual  costs  are  considered.  The  use  of 
the  matrix  representation  and  several  arrays  to  keep  track  of  potential  and  actual 
assignments  of  weapons  and  targets  allow  the  optimal  assignment  to  be  derived.  The 
performance  improvement  of  this  program  over  the  sorting  method  is  presented  in 
Chapter  5. 

The  complexity  of  the  Hungarian  method  was  analyzed  in  Section  3.3.2.  The 
complexity  of  the  B&L  algorithm  is  somewhat  worse  for  square  matrices  because  of 
the  leading  n3  term’s  coefficient.  For  nonsquare  matrices,  the  complexity  is  slightly 
improved.  Fewer  operations  are  required  because,  instead  of  searching  and  subtract¬ 
ing  both  column  and  row  minimums,  either  row  or  column  minimums  are  searched 
for  and  subtracted.  In  the  nonsquare  case  with  m  rows  and  n  columns  where  rn  <  ri. 
m  +  mn  operations  are  required  to  locate  and  subtract  minimum  values  compared 
to  the  2 (n  -f  n2)  operations  in  the  pure  Hungarian  method  where  the  matrix  must  be 
squared  with  dummy  elements.  Other  operations  in  the  B<k:L  algorithm  are  reduced 
by  similar  factors  because  the  extra  dummy  rows  or  columns  are  not  needed  Over¬ 
all.  the  total  number  of  operations  are  significantly  less  in  the  worst-case  where  each 
iteration  adds  only  one  additional  assignment.  Instead  of  n  —  1  complete  iterations, 
only  m  —  1  iterations  are  required.  Performing  a  complexity  analysis  similar  to  the 
one  in  section  3.3.2.  the  complexity  estimate  for  the  version  of  the  B«k:L  algorithm 
developed  here  is: 

operations  —  3nm 2  4-  4m2  —  2 mu  -  m  —  ]  (4  —  1  ) 

This  complexity  estimate  will  be  later  used  in  assessing  the  complexity  of  the 
parallel  versions  of  the  assignment  algorithm. 


4-S  Parallel  Assignment  Algorithm  Implementations 


In  this  section,  four  different  parallel  implementations  of  the  assignment  algo¬ 
rithm  are  presented.  The  first  three  programs  are  closely  related  and  vary  mainly  in 
the  amount  of  interprocessor  coordination.  Although  all  of  the  first  three  programs 
use  the  B<L’L  algorithm  function  developed  in  Section  4.2  to  perform  the  assignment 
task  in  parallel,  the  final  assignments  produced  are  not  optimal  Heuristics  are  used 
to  reduce  the  number  of  redundant  assignments  and  will  be  fully  discussed  in  the 
following  sections.  The  fourth  program  is  an  effort  to  implement  a  parallel  version 
of  the  B&rL  algorithm  whose  final  solution  is  optimal. 

As  studied  in  Chapter  2,  the  two  major  areas  to  be  concerned  with  when 
developing  parallel  programs  are  the  interprocessor  communications  and  the  problem 
partition  size.  The  effects  of  different  levels  of  interprocessor  communications  can 
be  studied  by  comparing  the  performance  of  these  first  three  programs.  The  type 
of  matrix  partitioning  discussed  in  Section  3.5.2  was  the  strip  method  where  entire 
rows  of  the  matrix  are  transferred  to  the  different  processors.  By  partitioning  in 
this  method,  each  processor  is  responsible  for  a  unique  group  of  weapons.  Complete 
cost  information  for  assigning  any  of  its  weapons  is  available  without  initiating  any 
communications  with  neighboring  processors.  Other  forms  of  communications  that 
are  necessary  will  be  discussed  in  the  description  of  each  program. 

4.3.1  The  First  Level:  No  Communications  The  “first  level"  or  level  1  paral¬ 
lel  program  is  the  case  where  there  is  no  coordination  between  any  of  the  processors 
in  the  iPSC.  Each  node  processor  works  entirely  independent  of  the  other  processors. 
Figure  4-1  illustrates  the  relationship  between  the  individual  processors  in  the  cube 
and  the  cube  manager.  With  a  5-dimension  hypercube,  up  to  32  processors  can 
operate  in  parallel  on  different  portions  of  the  cost  matrix.  The  execution  time  is 
expected  to  be  much  shorter  than  either  of  the  sequential  implementations,  however 
the  resulting  assignment  will  not  be  optimal. 
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Figure  4-1.  Processor  Communication  Paths  for  the  First  Level 

The  non-optimal  assignment  solution  of  this  implementation  results  from  the 
individual  processors  not  communicating  with  each  other  about  which  targets  have 
been  assigned.  As  a  consequence,  one  processor  may  assign  a  weapon  to  a  certain 
target  while  another  processor  may  assign  a  different  weapon  to  the  same  target. 
This  wastes  one  weapon  that  could  have  been  assigned  to  another  target.  A  larger 
number  of  processors  will  most  likely  result  in  more  redundant  assignments  and 
more  wasted  weapons,  but  will  yield  these  results  much  faster  than  could  a  single 
processor  implementation.  The  performance  evaluations  in  Chapter  5  address  both 
the  problem  of  redundancies  and  the  tradeoffs  between  the  speed  of  execution  and 
the  optimality  of  assignment. 

This  parallel  implementation  is  simply  an  extension  of  the  sequential  version 
developed  in  Section  4.2.2.  An  identical  node  process  is  loaded  into  all  of  the  pro¬ 
cessors  to  be  utilized.  The  host  program  prompts  for  problem  size  input  and  reads 
the  input  cost  data  from  an  external  file.  But  in  this  case,  the  cost  matrix  must  be 
partitioned  among  the  multiple  processors.  There  are  two  situations  that  must  be 
handled.  One  is  where  the  rows  of  the  cost  matrix  are  divided  evenly  among  the  node 
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processors  and  the  other  is  where  they  do  not.  In  the  uneven  case,  one  processor 
will  receive  an  odd  number  of  rows.  This  will  slightly  affect  the  load  balancing  of 
the  processors,  but  is  not  expected  to  be  a  significant  problem.  Once  each  processor 
completes  its  independent  assignment,  it  waits  until  prompted  by  the  host  to  return 
the  assignment  results  and  timing  information. 

In  this  implementation,  the  BicL  algorithm  is  essentially  running  in  each  of 
the  cube  processors.  The  complexity  of  this  B<L’L  algorithm  has  already  been  esti¬ 
mated  in  Section  4.2.2.  If  the  operations  required  to  transfer  the  cost  matrix  data  to 
the  individual  nodes  are  ignored,  the  complexity  estimate  can  be  derived  by  simply 
dividing  the  sequential  Bi:L  algorithm  complexity  estimate  by  the  number  of  proces¬ 
sors  being  used.  The  accuracy  of  this  estimate  is  tested  when  the  actual  performance 
data  is  analyzed  in  the  following  chapter. 

4-3.2  The  Second  Level:  Partial  Communications,  Single  Iteration  This  “sec¬ 
ond  level”  or  level  2  parallel  implementation  introduces  some  coordination  between 
the  processors  computing  assignments  for  certain  partitions  of  the  cost  matrix.  The 
coordination  is  performed  by  processors  designated  as  controller  processors.  The 
processors  performing  the  assignments  are  known  as  assign  processors.  A  possi¬ 
ble  processor  arrangement  for  two  partitions  is  illustrated  in  Figure  4-2.  For  this 
study,  the  number  of  controllers  available  is  2.  4,  or  8.  With  2  controllers,  up  to 
15  assign  processors  may  be  used  per  controller.  For  4  controllers,  up  to  7  assign 
processors  and  for  8  controllers  either  2  or  3  assign  processors  per  controller  may  be 
utilized.  The  level  2  implementation  described  in  this  section  and  the  “third  level"  or 
level  3  program  discussed  in  the  following  section  both  use  the  same  basic  processor 
arrangement. 

The  host  program  performs  essentially  the  same  function  as  in  the  first  level 
approach.  The  only  difference  is  that  the  host  communicates  with  the  controller 
processors  rather  than  the  assign  processors.  The  partitions  of  the  cost  matrix  sent 


Figure  4-2.  Communication  Paths  for  the  Second  and  Third  Levels 

to  the  controller  processors  are  further  subdivided  by  the  controllers  and  sent  to  the 
appropriate  assign  processors.  The  assign  processors  have  no  direct  communication 
with  the  host  except  when  a  global  START  command  is  issued  from  the  host  to  signal 
the  completion  of  all  data  distribution  functions  and  the  beginning  of  the  actual 
processing.  The  assignments  or  weapon-target  pairings  from  the  assign  processors 
are  examined  by  their  associated  controllers.  The  controllers  eliminate  redundancies 
in  the  weapon-target  pairings  by  comparing  the  individual  costs  of  those  that  are 
conflicting.  The  controllers  allow  the  lowest  cost  weapon  allocations  to  remain  and 
sets  all  the  higher  cost,  redundantly  assigned  weapons  to  an  idle  state.  The  results 
and  timing  information  are  sent  to  the  host  and  processed  as  previously  described. 

The  results  of  this  implementation  should  show  an  improvement  over  the  first 
level  approach.  Fewer  redundancies  and  lower  costs  are  some  of  the  expected  benefits. 
The  coordination  requires  extra  computations  that  may  degrade  performance  if  a 
large  number  of  redundancies  occur.  The  final  weapon-target  pairings  for  a  two 
controller  configuration  should  be  similar  to  the  first  level  implementation  using  two 
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The  controller  processors  add  a  significant  number  of  operations.  For  each 
controller,  the  individual  assignments  must  be  received  from  the  assign  processors 
and  a  master  assignments  list  compiled.  Then  the  list  must  be  searched  to  identify 
any  redundancies.  A  table  lookup  must  be  accomplished  for  each  individual  assign¬ 
ment  to  determine  the  lowest  cost  in  the  case  of  conflicts  and  to  derive  fhe  overall 
assignment  cost.  The  operations  required  for  the  assignment  list  compilation  depend 
on  the  partition  size.  A  large  number  of  controllers  allow  more  of  the  operations  to 
be  done  in  parallel.  The  worst-case  for  conflicts  would  be  where  every  assignment 
from  one  assign  processor  conflicts  with  an  assignment  from  another  processor.  The 
additional  load  from  the  controller  is  estimated  to  be: 


operations  =  n2 /p  +  2  (n/p)  +  n/2p  +  n/p  =  (n2  +  3.5  n)/p 


(4-2) 


where 


n  is  the  total  number  of  weapons 
and  p  is  the  number  of  controllers 

This  estimate  just  given  is  in  addition  to  the  complexity  estimate  for  the  level  1 
implementation.  The  complete  complexity  estimate  is  shown  in  Equation  4-3. 


operations  =  (3 nm2  +  4m2  +  n2  -  2 nm  +  3. on  —  m  —  \  )jp 


(4-3) 


4-3.3  The  Third  Level:  Partial  Communications,  Multiple  Iterations  The 
“third  level"  or  level  3  parallel  implementation  increases  the  amount  of  coordina¬ 
tion  performed  in  the  controller  processors.  The  controller  and  assign  processors  are 
utilized  in  the  same  configuration  as  the  second  level  approach,  illustrated  in  Figure 
4-2.  Instead  of  idling  the  redundantly  assigned  weapons  as  in  level  2,  these  weapons 
are  made  available  for  assignment  to  other  targets  not  yet  assigned. 

The  cost  matrix  is  partitioned  exactly  as  in  the  level  2  implementation.  Each 
group  of  assign  processors  report  to  one  specific  controller  processor.  The  controllers 
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receive  assignments  computed  on  partitions  of  the  cost  matrix  from  their  assign 
processors.  Each  controller  compiles  a  master  assignments  list.  The  redundancies 
are  then  determined  and  the  lowest  cost  individual  assignments  are  allowed  to  stand. 
The  weapons  involved  in  higher  cost  redundant  assignments  are  entered  into  one  list 
and  all  targets  that  are  not  assigned  are  entered  into  another  list.  Each  controller 
then  broadcasts  its  lists  to  all  assign  processors  under  its  control.  New  sets  of 
assignments  are  computed  and  sent  back  to  the  controllers  which  again  coordinate 
the  removal  of  any  new  redundancies.  This  process  continues  until  all  weapons  have 
been  assigned  to  a  different  target  and  all  redundancies  within  the  partitions  have 
been  eliminated.  Each  controller  then  sends  its  final  master  assignment  list  back  to 
the  host  where  it  is  compiled  into  a  final  assignment. 

The  final  assignment  from  this  implementation  will  also  not  be  optimal.  There 
may  be  some  redundancies  resulting  from  the  assignments  made  in  different  con¬ 
troller  partitions  because  there  is  no  coordination  between  the  controllers.  However, 
there  will  be  no  idle  weapons  due  to  the  multiple  iterations  performed  to  eliminate 
the  redundant  assignments  within  each  controller's  partition.  The  cost  of  the  final 
assignment  will  tend  to  be  higher  than  the  optimal  assignment  for  several  reasons. 
When  redundancies  occur  within  a  controller’s  partition,  at  least  one  of  the  final 
assignments  made  by  the  assign  processors  will  not  be  optimal  because  alternative 
weapon-targets  are  always  an  equal  or  higher  cost.  Although  redundancies  are  elim¬ 
inated  within  each  controller’s  partition,  other  redundancies  can  still  possibly  exist 
between  different  controllers. 

The  additional  operations  required  by  the  controller  processors  are  similar  to 
the  second  level  implementation.  However,  the  multiple  iterations  required  to  elim¬ 
inate  the  redundant  assignments  within  each  controller’s  partition  are  an  additional 
source  of  computational  overhead.  In  the  worst- case,  each  iteration  would  only  as¬ 
sign  one  of  the  available  weapons  for  each  assign  processor.  This  would  require  n/pq 
iterations  where  n  is  the  total  number  of  weapons,  p  is  the  number  of  partitions 


or  controllers,  and  q  is  the  number  of  assign  processors  per  controller.  Multiplying 
Equation  4-2  by  the  number  of  iterations  and  adding  the  result  to  Equation  4-3  yields 
the  following  expression  for  the  complexity  estimate  of  this  level  3  implementation: 


operations  =  (3 nm2  +  4m2  -f  n2  —  2 nm  +  3.5n  —  m  —  1  )/p  +  (n3  +  3.5 n2)/qp2  (4  —  4  ) 

The  coordination  process  is  very  expensive  in  terms  of  the  number  of  operations 
required.  In  the  worst-case,  it  is  of  the  approximately  same  order  as  the  BicL 
algorithm  itself.  Although  each  iteration  of  the  controller  process  requires  another 
iteration  of  the  B&rL  algorithm,  the  B&rL  algorithm  is  performed  on  subproblems 
of  successively  smaller  dimensions.  The  dominant  factor  is  the  controller  process 
because  it  requires  the  same  number  of  operations  on  each  iteration. 

4. 3. 4  The  Fourth  Level:  Parallel  Matrix  Operations  The  “fourth  level"  or 
level  4  implementation  is  a  different  approach  from  the  first  three  parallel  imple¬ 
mentations.  The  program  development  involved  studying  the  different  operations 
required  by  the  sequential  B&L  algorithm  and  identifying  the  operations  that  were 
the  most  time  consuming.  Then,  certain  operations  were  implemented  in  parallel  on 
multiple  processors. 

The  most  time  consuming  operations  of  the  algorithm  were  located  using  the 
timing  function  of  the  iPSC  on  different  segments  of  the  sequential  B<kL  algorithm 
implementation  described  in  Section  4.2.2.  Several  different  cost  matrix  sizes  and 
weapon-to-target  ratios  were  used  to  determine  the  algorithm’s  performance  charac¬ 
teristics.  Three  distinct  segments  of  the  sequential  B&rL  algorithm  were  identified  a?* 
consuming  more  than  75%  of  the  processing  time.  Not  unexpectedly,  these  code  seg¬ 
ments  involved  operations  carried  out  on  large  portions  of  the  cost  matrix.  Of  these 
three  code  segments,  one  of  them  dominated  the  processing  time  when  the  weapon 
to-target  ratio  was  greater  than  or  equal  to  1:5.  Because  the  weapon-to-target  ratio 
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in  a  realistic  scenario  is  expected  to  be  at  least  1:5  [Ad\V85],  this  particular  seg¬ 
ment  of  the  sequential  B&L  algorithm  was  chosen  for  as  the  prime  candidate  for 
parallelization.  The  operations  performed  in  this  segment  search  for  minimum  row 
and  column  values,  subtract  the  minimum  values  from  the  entire  matrix,  and  locate 
the  resulting  independent  zero  elements.  These  operations  are  analogous  to  the  first 
three  steps  of  the  Hungarian  method  description  found  in  Appendix  B.  The  other 
time-consuming  code  segments  were  not  chosen  for  parallelization  because  of  the 
higher  amount  of  message  passing  that  would  be  necessary  to  update  various  global 
arrays  used  to  coordinate  the  refinement  of  the  initial  assignment  solution. 

The  implementation  of  this  program  was  divided  into  three  portions.  The  usual 
host  program  performs  the  functions  described  in  the  previous  implementations. 
There  are  two  different  node  programs.  One  is  known  as  the  serial  process  and 
it  performs  the  serial  tasks  of  the  B&L  algorithm.  The  other  node  program  is 
the  parallel  process.  The  multiple  parallel  processes  are  subordinate  to  the  serial 
process  and  perform  the  operations  identified  as  time  consuming  in  the  preceeding 
paragraph.  Figure  4-3  illustrates  the  communication  paths  between  processors  in 
this  implementation. 

Each  parallel  process  operates  on  a  particular  “strip”  of  the  cost  matrix.  Ini¬ 
tially.  the  minimum  value  in  each  row  is  determined  and  then  this  minimum  value  is 
subtracted  from  each  element  in  that  row.  These  row  operations  can  be  performed  in 
parallel  without  any  interprocessor  communications.  However,  in  the  case  of  square 
matrices,  the  minimum  elements  in  each  column  must  also  be  determined  and  then 
subtracted.  Because  the  row  subtractions  are  performed  on  horizontal  strips,  no 
processor  will  contain  a  completely  modified  column.  This  requires  that  each  pro¬ 
cessor  search  a  portion  of  each  column  for  minimum  elements.  The  overall  minimum 
element  of  each  column  is  determined  from  the  individual  processor  contributions  bv 
using  a  global  operation  function.  The  overall  column  minimum  is  then  broadcast  to 
all  processors  for  subtraction  from  their  segment  of  the  column.  After  all  minimum 
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elements  have  been  subtracted,  then  the  independent  zero  elements  are  determined 
and  used  to  make  the  initial  assignment  of  weapons  to  targets. 

In  the  three  previous  parallel  implementations,  there  is  a  problem  with  the 
redundant  assignment  of  weapons  to  the  same  target.  In  this  implementation,  the 
problem  is  eliminated  by  coordinating  the  assignment  process.  Since  each  processor 
contains  a  strip  of  the  cost  matrix,  the  assignments  will  be  made  by  using  only  the 
cost  information  from  this  strip.  Two  vectors,  one  containing  the  weapon  number 
assigned  to  each  target  and  the  other  containing  the  target  numbers  that  have  been 
assigned  will  be  used  as  the  means  of  coordination.  Each  strip  is  further  subdivided 
into  separate  “windows.-  The  parallel  assignment  process  will  require  a  number  of 
iterations  equal  to  the  number  of  these  “windows."  During  a  parallel  assignment 
iteration,  each  processor  makes  assignments  on  a  different  independent  window.  The 
term  independent  means  that  the  targets  being  considered  for  assignment  are  not 
being  considered  by  any  other  processor  during  the  present  iteration.  The  weapons 
are  already  independent  by  virtue  of  the  strip  method  of  partitioning.  After  each 
iteration,  the  individual  assignment  contributions  from  each  processor  are  used  t c> 
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update  the  two  assignment  vectors  using  a  global  concatenation  function.  Then  tin- 
vectors  are  broadcast  to  each  processor  and  the  next  set  of  independent  windows  are 
searched  for  possible  assignments.  After  the  last  iteration  of  the  parallel  assignment, 
the  number  of  weapons  that  have  been  assigned  is  checked.  If  all  weapons  have  been 
assigned,  then  the  algorithm  terminates.  If  not,  then  the  remainder  of  the  program 
operates  exactly  as  the  sequential  version  explained  earlier  in  Section  4.2.2. 

The  solution  produced  by  this  implementation  will  be  the  minimum-cost  op¬ 
timal  assignment.  The  final  results  will  be  the  same  as  those  produced  by  the 
sequential  version  of  the  BT’L  algorithm  described  in  Section  4.2.2.  However,  de¬ 
velopment  and  initial  testing  of  this  implementation  indicates  that  it  will  possibly 
require  as  much  or  more  time  than  the  purely  sequential  version.  The  primary  reason 
for  this  is  the  volume  and  frequency  of  interprocessor  communications  used  to  coor¬ 
dinate  the  assignments  and  eliminate  the  redundancies.  Specific  performance  data 
are  presented  in  the  following  chapter  and  comparisons  are  made  with  the  other 
implementations. 

In  the  nonsquare  matrix  case  where  the  number  of  targets  is  greater  than  the 
number  of  weapons,  the  number  of  operations  required  at  first  appears  to  be  reduced 
because  of  the  multiple  processors  performing  the  operations  in  parallel.  This  holds 
only  when  there  is  little  or  no  coordination  required.  After  the  row  minimums 
have  been  subtracted,  each  of  the  assignment  iterations  on  the  windows  described 
earlier  require  the  transmission  of  node  contributions  to  the  serial  processor,  which 
in  turn  broadcasts  the  updated  vectors  back  to  the  parallel  processors.  Much  of  the 
communications  processing  involved  with  the  sending  of  messages  between  nodes 
is  performed  by  the  operating  system  and  the  number  of  operations  involved  is 
not  easily  determined.  However,  the  sizes  of  the  the  vectors  are  known,  so  some 
rough  estimate  of  the  processing  can  be  made.  One  vector  length  is  equal  to  the 
number  of  weapons  and  the  other  is  equal  to  the  number  of  targets.  The  number  of 
windows  will  be  equal  to  the  number  of  parallel  processors  utilized.  At  least  m  -f  n 
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operations  will  be  required  to  combine  the  node  contributions  into  a  single  vector 
during  each  iteration  where  m  is  the  number  of  weapons  and  n  is  the  number  of 
targets.  If  p  is  the  number  of  parallel  processors,  then  at  least  p(n  +  m)  additional 
operations  are  required  in  the  parallel  implementation.  The  row  minimum  search 
and  subtraction  will  also  require  some  additional  operations  to  transmit  the  cost 
information  to  the  serial  processor,  but  the  actual  number  of  operations  is  difficult 
to  determine  because  of  the  message  passing.  After  considering  these  additional 
factors,  the  complexity  estimate  of  the  fourth  level  implementation  for  the  worst- 
case  where  only  one  assignment  is  found  in  the  parallel  segment  is  estimated  to  be 
as  follows: 


operations  =  2nm2  +  4m2  -f  mn(2/p  +  4/p2  -  l)  +  m/p  +  p(m  +  n  +  l)-n-m  (4-5) 

It  is  obvious  from  the  complexity  estimate  that  the  parallel  version  of  the  B&rL 
algorithm  will  require  more  operations  than  the  level  1  implementation  in  the  worst- 
case.  The  actual  performance,  using  data  that  is  not  worst-case,  will  be  examined 
in  Chapter  5. 

4-4  Summary 

This  chapter  restated  the  assumptions  given  in  Chapter  1  and  provided  more 
background  on  the  ballistic  missile  simulation  program  used  as  an  aid  in  generating 
input  data  for  the  programs  developed.  Two  sequential  programs  were  presented, 
one  which  utilized  a  sorting  method  to  order  the  assignment  costs  and  the  other 
which  used  a  modified  version  of  the  Hungarian  method  presented  in  Chapter  3 
and  Appendix  B.  Four  parallel  p'ograms  were  described  which  involved  different 
levels  of  interprocessor  communications.  The  first  three  used  the  BAT.  algorithm 
code  replicated  in  certain  nodes  and  partitioned  the  cost  matrix  among  the  different 
processors.  The  fourth  parallel  program  attempted  to  perform  certain  operations  of 
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the  B<L'L  algorithm  in  parallel.  The  computational  complexity  of  each  implemen¬ 
tation  was  estimated.  In  Chapter  5,  regression  analyses  is  used  to  determine  how 
well  the  plots  of  predicted  and  actual  processing  times  match.  The  relative  perfor¬ 
mance  of  each  implementation  is  compared  in  terms  of  speedup  over  single  processor 
implementations  and  the  optimality  of  the  assignments. 
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5.  Experimental  Results  and  Performance  Analysis 

The  details  of  implementing  the  sequential  and  parallel  assignment  algorithms 
were  presented  in  the  previous  chapter.  Test  cases  were  devised  that  were  small 
enough,  to  permit  hand  calculation  of  the  assignment  results.  After  these  imple¬ 
mentations  were  tested  with  this  test  data  to  insure  the  assignment  results  were 
correct,  a  series  of  performance  runs  was  made  with  larger  cost  matrices  and  data 
was  collected.  This  chapter  presents  these  experimental  results  and  analyzes  them 
according  to  the  criteria  stated  in  Chapter  1.  The  specific  performance  criteria  are 
computation  times,  speedups,  interprocessor  communications,  load  balancing,  and 
machine-size  to  problem-size  relationships.  This  chapter  is  organized  into  three  ma¬ 
jor  sections.  The  first  section  defines  the  performance  criteria  and  the  method  of 
data  collection.  The  second  section  presents  the  performance  results  of  all  the  im¬ 
plementations  and  evaluates  the  predicted  complexity  estimates  made  in  Chapter  4. 
The  last  major  section  analyzes  and  compares  these  results  according  to  the  criteria 
defined  in  the  first  section.  This  chapter  ends  with  a  summary  of  the  experimental 
results  and  analyses. 

5.1  Testing  Approach 

In  the  engagement  of  defensive  weapons  against  a  full-scale,  global  missile 
attack,  the  defensive  weapons  will  most  likely  be  outnumbered  by  the  incoming 
missiles.  Several  estimates  of  the  ratio  between  defensive  weapons  and  incoming 
targets  (referred  tc  as  the  ratio  of  weapons-to-targets  from  here  on)  have  been  made 
in  the  open  literature  [AdF85,  AdW85,  BoW85,  DrF85|.  Although  predicted  ratios 
of  weapons-to-targets  vary,  depending  on  the  assumptions  made  and  the  method 
of  analysis,  most  estimates  range  from  1:1  to  1:10.  Based  on  these  estimates,  cost 
matrix  sizes  corresponding  to  weapon-to-target  ratios  oi  1:1,  1:5,  and  1:10  were 
chosen  for  this  study.  The  number  of  weapons  was  chosen  to  range  from  32  to  128. 
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This  range  of  weapons  was  selected  in  order  to  study  the  assignment  problem  on  a 
small  scale  and  does  not  represent  any  estimate  of  the  number  of  defensive  weapons 
that  may  be  actually  deployed. 


For  the  experimental  tests,  five  different  cost  matrices  for  each  matrix  size  were 
generated  using  the  program  that  was  described  in  Chapter  4.  Several  trial  runs 
were  made  with  each  implementation  to  test  the  variability  of  the  processing  times 
obtained  and  to  establish  the  number  of  test  runs  needed.  An  analysis  of  the  means 
and  the  variances  of  the  processing  times  was  performed  using  a  statistical  data 
analysis  package  known  as  SAS  (a  registered  trademark  of  the  SAS  Institute,  Cary. 
N.C.)  [CoS87],  Five  sets  of  matrices  for  each  size  were  chosen  as  the  standard  number 
of  runs  because  the  mean  processing  times  for  the  same  number  of  processors  were  not 
found  to  be  significantly  different  from  each  other  within  a  95%  confidence  level.  The 
same  test  performed  on  the  mean  processing  times  obtained  using  different  numbers 
of  processors  with  the  same  suite  of  input  data  did  show  significant  differences,  as 
expected.  The  ANOVA  (Analysis  Of  the  VAriance)  procedure  of  SAS  showed  that 
modeling  the  processing  times  as  a  function  of  the  input  data  (the  different  cost 
matrices)  with  the  number  of  processors  held  constant  was  a  very  poor  model.  It 
had  a  probability  of  rejection  of  0.9942.  This  indicates  that  the  different  input  data 
sets  do  not  have  a  significant  effect  on  the  processing  times.  On  the  other  hand,  if 
the  same  input  data  was  used  for  different  numbers  of  processors  and  the  processing 
times  were  modeled  as  a  function  of  the  number  of  processors,  the  probability  of 
rejecting  this  model  was  less  than  0.0001.  This  means  there  is  a  better  than  99.99% 
chance  that  the  number  of  processors  used  has  an  effect  on  the  processing  times. 

5.1.1  Performance  Criteria  As  stated  in  Chapter  1  and  repeated  in  the  in¬ 
troduction  to  this  chapter,  there  are  a  number  of  performance  measures  that  need 
to  be  analyzed  and  compared  for  each  implementation.  In  this  section,  each  of  these 
measures  are  briefly  defined  and  any  special  considerations  are  explained. 

The  first  performance  measure  is  the  computation  or  processing  time.  In  se- 


quential  processors,  this  performance  index  is  relatively  simple  to  neasure.  However, 
in  an  MIMD  multiple-processor  machine  such  as  the  iPSC,  there  are  many  factors 
that  affect  the  ease  with  which  actual  processing  times  can  be  measured.  The  parallel 
solutions  to  many  problems  involve  three  principle  phases:  start-up,  computation, 
and  wind-down  [WaL85],  In  this  research,  the  start-up  and  wind-down  phases  are 
especially  time  consuming  because  of  hardware  constraints  imposed  by  the  Intel 
iPSC.  One  major  constraint  is  that  there  is  only  cue  serial  data  channel  from  the 
cube  manager  to  the  node  processors.  This  limits  the  speed  of  transferring  initial 
cost  data  to  the  node  processors  and  receiving  the  results  from  the  node  proces¬ 
sors.  Improved  parallel  I/O  techniques  have  been  implemented  in,  for  example,  the 
NCUBE  hypercube  [HaM86]  and  the  PASM  prototype  [SiS84].  The  start-up  and 
wind-down  times  in  the  iPSC  implementation  unnecessarily  bias  the  runtimes.  As  a 
result,  they  will  not  be  included  in  the  total  processing  times  reported.  The  timing 
will  commence  when  all  processors  have  received  the  initial  cost  data  and  terminate 
when  the  last  processor  finishes  its  computations  and  is  ready  to  return  results. 

One  common  performance  measure  in  parallel  processing  is  speedup  (5).  This 
index  relates  the  time  to  compute  a  solution  with  one  processor  with  the  time  to 
compute  a  similar  solution  with  N  processors.  It  is  defined  as  follows: 


S  =  TJTn  (5-1) 

where 

T\  is  the  computation  time  for  one  processor  and 
7/v  is  the  computation  time  for  N  processors 

If  a  problem  can  be  broken  down  into  N  independent  pieces,  then  N  processors  can 
solve  these  N  pieces  ini/  Nth  of  the  time  required  by  a  single  processor.  The  T\  times 
reported  in  this  research  are  those  obtained  from  using  one  node  processor  of  the 
iPSC.  Perfect  speedup  is  N,  but  this  is  not  normally  achieved  in  practice.  In  some 
instances,  certain  implementations  achieve  superlinear  or  greater  than  ,Y  speedup. 
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There  are  several  factors  that  can  account  for  this  surprising  result.  In  this  research, 
some  of  the  speedups  reported  are  superlinear.  One  reason  for  this  is  that  the  start-up 
and  the  wind-down  times  are  not  included  for  the  reasons  discussed  earlier.  Another 
reason  is  that  although  the  processing  times  and  assignment  costs  of  sequential 
Bi:L  algorithm  are  compared  with  those  of  the  parallel  versions,  the  algorithms 
being  compared  are  very  different.  The  sequential  B&:L  algorithm  yields  the  optimal 
overall  assignment,  where  the  parallel  versions  are  heuristic  methods  designed  to 
produce  acceptable  results  that  are  near  optimal,  but  not  exactly  optimal. 

In  a  strict  interpretation  of  speedup,  the  results  of  two  different  configurations 
should  be  the  same.  However,  in  this  thesis,  the  term  speedup  will  be  used  as  one 
measure  of  performance  between  implementations  yielding  very  different  results.  For 
this  reason,  speedup  alone  is  not  sufficient  and  must  be  taken  in  conjunction  with 
other  measures  such  as  the  optimality  of  the  results  or  the  percent  effective. 

Interprocessor  communications  were  discussed  in  Chapter  3  as  one  of  the  more 
important  overheads  to  minimize  in  parallel  implementations.  In  the  results  that  will 
be  presented  shortly,  the  actual  time  spent  communicating  between  processors  will 
not  be  explicitly  shown.  The  method  that  will  be  used  to  assess  the  communications 
effect  will  be  to  compare  the  other  performance  measures  of  the  different  implemen¬ 
tations.  The  increasing  levels  (level  1  to  level  4)  of  implementation  correspond  to 
increasing  levels  of  interprocessor  communications.  The  criteria  is  straightforward: 
if  higher  levels  of  implementations  perform  better,  then  higher  levels  of  communi¬ 
cations  are  better.  On  the  other  hand,  if  lower  levels  of  implementations  perform 
better,  then  lower  levels  of  communications  are  better.  Of  course  there  are  tradeoffs 
between  different  performance  characteristics.  Different  applications  may  require 
higher  performance  in  one  area  and  accept  poorer  performance  in  another  area. 
Issues  of  this  type  are  discussed  further  in  the  concluding  sections  of  this  chapter. 

Load  balancing  is  a  performance  measure  that  compares  the  processing  times 
of  the  individual  processors  in  a  parallel  system.  The  purpose  is  to  determine  if 
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all  processors  are  performing  approximately  the  same  amount  of  work  or  if  several 
processors  remain  idle  while  a  few  processors  are  performing  a  majority  of  the  total 
processing.  Perfect  load  balance  at  first  appears  to  be  the  ultimate  objective,  but  :f 
the  load  balance  is  achieved  solely  by  excessive  communications  between  nodes,  then 
very  few  useful  computations  are  likely  being  performed.  In  later  sections,  specific 
times  are  not  presented,  but  representative  times  are  discussed  and  the  issue  of  load 
balancing  is  evaluated  for  each  parallel  implementation. 

The  machine-size  to  problem-size  relationship  or  scalability  is  an  important 
measure  that  shows  how  the  small-scale  experimental  results  can  be  applied  to  larger 
“real  world"  applications.  The  primary  means  of  evaluating  this  relationship  is  to 
first  use  regression  analyses  to  determine  the  models  that  best  fit  the  data  that  has 
been  collected.  Then  reasonable  estimates,  based  on  these  models  and  plots  of  the 
collected  data,  are  made  for  larger  problem  and  machine  sizes.  Because  of  the  nature 
of  some  problem  solution  times,  these  estimates  are  subject  to  some  error  and  should 
not  be  taken  as  absolute. 

5.1.2  Method  of  Data  Collection  As  explained  in  the  introduction  to  this 
section,  five  sets  of  matrices  were  generated  for  each  different  matrix  size.  Two 
sequential  implementations  and  four  parallel  implementations  were  tested.  The  data 
for  the  single- processor  B&L  algorithm  is  included  in  the  level  1  data  presentation. 
From  this  point  on,  the  different  parallel  implementations  are  referred  to  as  level  1, 
level  2,  level  3,  or  level  4  corresponding  to  the  first  level,  second  level,  and  so  on 
implementations  described  in  Chapter  4.  The  sorting  method  implementation  is 
referenced  as  the  level  0  implementation. 

Because  five  runs  per  matrix  size  were  earlier  shown  to  be  statistically  ade¬ 
quate,  each  implementation  was  tested  with  the  same  set  of  five  matrices  so  that 
direct  comparisons  of  computation  times,  speedups,  and  communications  overhead 
can  be  made.  However,  some  of  the  performance  runs  for  the  level  4  implementation 


were  not  possible  because  of  limitations  in  system  buffers  used  to  handle  the  intern¬ 
ode  message  traffic.  This  occurred  when  the  cost  matrix  was  large  and  greater  than 
16  processors  w'ere  being  used.  Complete  data  for  the  level  0  (sorting  method)  imple¬ 
mentation  is  also  not  presented  because  as  the  matrix  sizes  increased,  the  processing 
times  increased  very  rapidly. 

In  some  instances,  the  0.005  second  resolution  of  the  timing  function  of  the 
iPSC  affected  the  accuracy  of  the  timing  results.  Cases  where  the  error  exceeds  10‘/( 
of  the  reported  mean  of  the  processing  times  are  marked  with  an  asterisk  (*)  and 
are  mainly  confined  to  the  32-weapon  cases  where  utilizing  more  than  16  processors 
resulted  in  processing  times  approaching  the  0.005  second  resolution.  All  derived 
speedups  associated  with  these  suspect  processing  times  are  also  marked  with  an 
asterisk  and  are  not  considered  to  be  accurate. 

5.2  Presentation  of  Results 

The  results  of  all  the  implementations  are  presented  in  this  section.  It  is 
organized  into  subsections  that  correspond  to  the  name  given  to  the  implementation. 
All  of  the  96- weapon  data  for  the  three  ratios  discussed  earlier  are  given  in  tabular 
form.  The  data  for  32,  64,  and  128  weapon  evaluations  are  included  in  Appendix  C. 

5.2.1  Level  0  The  level  0  or  sorting  method  program  was  developed  to  pro¬ 
vide  a  baseline  for  comparison  with  the  other  implementations.  However,  because 
of  system  load,  complete  data  for  all  the  matrix  sizes  was  not  obtained.  For  the 
largest  matrix  size  (128  x  1280),  the  processing  time  was  estimated  to  be  in  excess  of 
three  hours  per  run.  The  average  processing  times  and  assignment  cost s  that  were 
obtained  are  shown  in  Table  5-1.  The  Size  column  represents  the  product  of  the 
Weapons  and  Targets  columns  and  shows  the  number  of  elements  in  the  cost  list 
that  must  be  sorted.  The  entries  in  Table  5-1  are  sorted  according  to  the  number  of 
elements  in  the  cost  list.  The  Cost  column  shown  in  Table  5-1  represents  the  sums  of 
the  corresponding  values  from  the  cost  matrix  for  the  weapon-to-target  assignments 


list  grows,  the  sorting  time  grows  accordingly.  As  pointed  out  in  Chapter  4,  the 
time  required  to  sort  this  list  is  the  dominating  factor  of  this  implementation.  This 
explains  why  the  processing  time  (Time)  is  more  closely  related  to  the  size  of  the 
cost  list  than  to  the  weapon-to-target  ratio. 

The  complexity  of  this  implementation  was  estimated  in  Chapter  4  to  be 
0(Ar35).  Performing  a  regression  analysis  on  these  processing  times  resulted  in 
the  model  shown  in  Equation  5-2. 

processing  time  =  3.54  x  10~7WT2  +  6.103  (5  -  2) 

where 

W  is  the  number  of  weapons  and 
T  is  the  number  of  targets 

The  adjusted  R 2  coefficient  produced  by  SAS  in  a  regression  analysis  is  a  measure  of 
how  well  the  predicted  and  actual  times  match,  with  1 .0  being  a  perfect  match.  The 
adjusted  R2  coefficient  between  the  predicted  and  actual  processing  times  for  this 
model  was  0.9725.  For  an  equal  number  of  targets  and  weapons,  this  corresponds 
to  0(A'3),  so  the  estimate  made  in  Chapter  4  was  somewhat  pessimistic.  However, 
the  estimate  was  based  on  the  number  of  operations  expected.  In  Equation  5-2,  the 
actual  processing  time  is  being  modeled.  There  are  many  operating  system  functions 
and  other  lower  level  instructions  being  executed  for  each  operation  estimated.  The 
relationship  between  high-level  operations  and  these  lower-level  operations  is  difficult 
to  determine.  However,  a  relationship  does  appear  to  exist.  The  processing  times 
and  the  estimated  number  of  operations  were  tested  for  correlation  and  the  Pearson 
correlation  coefficient  was  found  to  be  significant  to  better  than  a  95%  confidence 
level. 

5.2.2  Level  1  The  level  1  implementation  is  the  first  parallel  implementation 
where  there  are  no  communications  between  any  of  the  processors  in  the  hypcrcube. 
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The  mean  processing  times  and  related  speedups  over  both  the  single-processor  BdcL 
(p  algorithm  (Sbll)  and  the  level  0  implementation  (Ssor()  for  the  96- weapon  cases  are 

shown  in  Table  5-2.  Similar  results  were  obtained  for  other  numbers  of  weapons 
and  are  included  in  Appendix  C.  In  Table  5-2,  the  Processors  column  contains  the 
number  of  processors  utilized  to  obtain  the  corresponding  mean  processing  times 
(Time)  reported.  The  number  of  processors  used  is  also  the  number  of  partitions 
made  on  the  input  cost  matrix. 
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Table  5-2.  Timing  and  Speedups  of  the  Level  1  Implementation 


Weapons 

Targets 

Processors 

Time  (sec) 

$BitL 

Ssort 

96 

96 

.  ..  r 

8.9020 

1.00 

4.67 

96 

96 

2 

1.4335 

6.21 

25.10 

96 

96 

4 

0.5180 

17.19 

69.47 

96 

96 

8 

0.2206 

40.35 

163.14 

96 

96 

16 

0.1028 

86.60 

350.08 

96 

96 

32 

*0.0494 

*180.20 

*728.50 

96 

480 

1 

7.7080 

1.00 

— 

96 

480 

2 

3.7935 

2.03 

— 

96 

480 

4 

1.8853 

4.09 

— 

96 

480 

8 

0.9464 

8.14 

— 

96 

480 

16 

0.4772 

16.15 

— 

96 

480 

32 

0.2422 

31.82 

— 

96 

960 

1 

15.0570 

1.00 

— 

96 

960 

2 

7.5195 

2.00 

— 

96 

960 

4 

3.7635 

4.00 

— 

96 

960 

8 

1.8891 

7.97 

— 

96 

960 

16 

0.9518 

15.82 

— 

96 

960 

32 

0.4836 

31.13 

— 

The  Sgt-i  speedups  shown  in  Table  5-2  are  all  superlinear  for  the  96- weapon. 
96-target  cases.  Some  of  the  Sbli.  speedups  for  the  1:5  ratio  cases  were  slightly 
better  than  perfect  (perfect  speedup  =  number  of  processors  utilized),  while  the 
1:10  ratio  cases  (96  x  960)  were  slightly  less  than  perfect  as  more  processors  were 
utilized.  One  reason  why  the  speedups  became  less  than  perfect  as  the  ratio  of 
weapons-to-targets  increased  is  directly  related  to  how  the  processing  times  behave. 
In  the  B&L  algorithm,  when  the  cost  matrix  is  square  (i.e.,  the  number  of  weapons 
equals  the  number  of  targets),  the  initial  solution  calculated  is  nearly  always  not 
optima]  and  must  be  reshuffled  to  obtain  the  optimal  solution.  However,  as  the 
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input  cost  matrix  becomes  more  and  more  rectangular,  the  initial  solution  more 
often  than  not  is  optimal  and  the  reshuffling  portion  of  the  B&L  algorithm  not 
performed.  This  results  in  a  time  savings  and  the  less  than  linear  increase  in  the 
BicL  algorithm  processing  times  as  the  number  of  weapons  is  held  constant  and  the 
number  of  targets  is  increased. 

When  the  original  cost  matrix  is  divided  into  more  and  more  partitions,  the 
resulting  partitions  are  increasingly  more  rectangular.  This  results  in  faster  compu¬ 
tations  of  the  partition  assignments  because  the  reshuffling  portion  of  the  algorithm 
is  bypassed  and  results  in  superlinear  speedups  over  the  single-node  processing  time. 
This  is  mainly  true  for  the  case  when  the  original  cost  matrix  is  square.  When  the 
original  cost  matrix  is  rectangular,  then  the  previously  described  behavior  is  already 
in  effect  in  the  single-node  processing  times.  The  partitioning  still  produces  more 
rectangular  submatrices,  but  the  relative  reduction  in  processing  times  is  not  as  great 
and  results  in  the  more  expected  near-linear  speedups  shown  in  Table  5-2. 

The  speedups  over  the  sorting  method  Ssort  are  only  shown  for  the  1:1  ratio 
case  because  the  level  0  1:5  and  1:10  cases  for  96  weapons  were  not  run  as  previously 
explained.  For  the  single-processor  case,  which  is  equivalent  to  a  sequential  B&:L 
algorithm,  the  level  1  processing  times  are  more  than  four  times  faster  than  the 
level  0  times.  When  multiple  processors  are  utilized,  the  speedup  Ssort  becomes  very 
large  and  illustrates  the  speed  advantage  of  the  parallel  level  1  implementation. 

A  regression  analysis  similar  to  the  one  explained  in  the  level  0  presentation 
was  performed  using  the  processing  times  shown  in  Table  5-2  and  Appendix  C.  The 
resulting  models  of  the  processing  times  were  of  the  same  order  as  the  complexity 
estimate  made  for  the  B&L  algorithm  in  Chapter  4.  The  models  differed  from  the 
estimate  in  the  coefficients  of  the  terms  and  some  of  the  lower  order  terms  were  not 
significant.  The  coefficients  are  different  because  of  the  previously  mentioned  rela¬ 
tionship  between  high-level  operations  and  lower-level  machine  instructions.  Also, 
the  estimate  was  based  on  an  assumed  worst-case  scenario,  while  the  data  used  in 
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these  performance  runs  was  not  worst-case.  As  an  example,  the  model  obtained  for 
the  eight-processor,  1:1  weapon-to-target  ratio  is  given  in  the  following  equation: 


processing  time  =  —5.6283  x  10  8(l YT2)  -f  0.00002964(7’2)  (5  —  3) 

The  adjusted  R 2  coefficient  for  this  model  was  0.9924  and  the  probability  of 
rejection  was  less  than  0.0001.  Similar  models  were  obtained  for  other  numbers 
of  processors  and  weapon-to-target  ratios.  One  concern  is  the  negative  sign  of  the 
leading  term.  This  indicates  that  the  T 2  and  WT2  terms  are  interactive  and  to 
some  extent  cancel  each  other  out.  Each  term  was  modeled  individually  and  yielded 
acceptable  models.  The  best  fit  was  obtained,  however,  when  both  terms  were  com¬ 
bined  into  a  single  model.  The  Pearson  correlation  coefficient  between  the  predicted 
number  of  operations  and  the  actual  processing  times  was  very  significant,  which 
indicates  that  there  is  some  relationship  between  the  two.  For  example,  the  Pearson 
coefficient  between  the  estimated  number  of  operations  and  the  1:1  weapon-to-target 
ratio  processing  times  was  0.98397  and  the  probability  of  rejection  was  0.0160. 

The  assignment  costs  and  other  information  for  the  96-weapon  case  are  shown 
in  Table  5-3.  The  column  labeled  %  Effective  is  defined  the  same  as  in  level  0.  An 
additional  column  named  %  Wasted  contains  the  percentage  of  weapons  that  were 
redundantly  paired  with  a  previously  assigned  target.  These  weapons  were  therefore 
“wasted”  on  a  target  that  was  already  “killed”  by  another  weapon. 

One  trend  that  should  be  noted  in  Table  5-3  is  that  as  the  ratio  of  weapons-to- 
targets  increases,  the  %  Effective  also  increases  for  the  same  number  of  processors. 
This  is  a  result  of  fewer  redundant  weapon  allocations,  which  are  in  turn  a  result 
of  the  larger  number  of  possible  targets.  The  assignment  costs  are  also  lower  for 
the  uigher  ratios  of  weapons-to-targets  due  to  the  wider  choice  of  possible  targets 
which  may  be  engaged  by  each  weapon.  In  all  cases,  the  %  Effective  drops  as  more 
processors  are  used  because  none  of  the  processors  coordinate  the  assignments  made 
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Table  5-3.  Assignment  Results  of  the  Level  1  Implementation 


Weapons 

Targets 

Processors 

Coat 

%  Effective 

%  Wasted 

96 

96 

1 

1779.2 

100.0 

0.0 

96 

96 

2 

1648.0 

72.3 

27.7 

96 

96 

4 

1576.0 

65.8 

34.2 

96 

96 

8 

1544.0 

62  7 

37.3 

96 

96 

16 

1510.4 

61.0 

39.0 

96 

96 

32 

1491.2 

60.6 

39.4 

96 

480 

1 

756.8 

100.0 

0.0 

96 

480 

2 

756.8 

89.6 

10.4 

96 

480 

4 

755.2 

84.8 

15.2 

96 

480 

8 

755.2 

83.1 

16.9 

96 

480 

16 

752.0 

83.6 

16.4 

96 

480 

32 

755.2 

82.1 

17.9 

96 

960 

1 

723  2 

100.0 

0.0 

96 

960 

2 

723.2 

92.1 

7.9 

96 

960 

4 

723.2 

87.7 

12.3 

96 

960 

8 

723.2 

84.6 

15.4 

96 

960 

16 

723.2 

83.3 

16.7 

96 

960 

32 

723.2 

82.5 

17.5 

within  each  partition.  In  later  implementations,  different  levels  of  coordination  are 
introduced  in  an  attempt  to  reduce  the  redundant  weapon  allocations  and  increase 
the  %  effective  utilization. 


5.2.3  Level  2  The  level  2  implementation  introduces  a  small  amount  of  co¬ 
ordination  between  groups  of  processors  in  order  to  reduce  the  number  of  redundant 
assignments.  The  96-weapon  timing  and  speedup  results  are  shown  in  Table  5-4. 
The  Sbll  and  Ss0rt  speedups  shown  in  Table  5-4  were  calculated  in  the  same  man¬ 
ner  described  in  the  level  1  presentation.  The  Cntrl  column  refers  to  the  number 
of  partitions  or  controller  groups  used  in  the  configuration.  The  Proc/Cntrl  column 
refers  to  the  number  of  processors  per  controller.  The  Tot  Proc  column  contains 
the  total  number  of  processors  utilized  in  a  particular  configuration  and  is  derived 
by  multiplying  the  number  of  controllers  by  the  number  of  processors  per  controller 
and  then  adding  the  number  of  controllers  to  the  product.  For  example,  a  two  con¬ 
troller  configuration  with  four  processors  per  controller  will  utilize  (2x4)4-2  =  10 


processors. 


Table  5-4.  Timing  and  Speedups  of  the  Level  2  Implementation 


Weapons 

Targets 

Cntrl 

Tot  Proc 

Time  (sec) 

Sbll 

Ssort 

96 

96 

2 

6 

■  f  ■ 

54.61 

96 

96 

4 

10 

95.97 

96 

96 

8 

18 

0.239 

E  <9 

150.58 

96 

96 

2 

12 

0.189 

190.41 

96 

96 

4 

20 

0.149 

59.74 

96 

96 

8 

2 

24 

*0.046 

*193.52 

96 

480 

2 

2 

6 

3.89 

— 

96 

480 

2 

4 

10 

7.38 

— 

96 

480 

2 

8 

18 

0.606 

12.72 

— 

96 

480 

4 

2 

12 

0.951 

8.11 

— 

96 

480 

4 

4 

20 

0.523 

14.74 

— 

96 

480 

8 

2 

24 

0.468 

16.47 

— 

96 

960 

2 

2 

6 

4.615 

3.26 

— 

96 

960 

2 

4 

10 

2.359 

6.38 

— 

96 

960 

2 

8 

18 

1.228 

12.26 

— 

96 

960 

4 

2 

12 

2.305 

6.53 

— 

96 

960 

4 

4 

20 

1.176 

12.80 

— 

96 

960 

8 

2 

24 

1.163 

12.95 

— 

The  processing  times  and  speedups  shown  in  Table  5-4  are  divided  into  sec¬ 
tions  corresponding  to  the  1:1,  1:5,  and  1:10  weapon-to-target  ratios.  Each  of  these 
sections  can  be  further  subdivided  into  three  sub-sections  by  the  number  of  con¬ 
troller  groups  (Cntrl).  By  grouping  in  this  manner,  the  effects  of  adding  additional 
processors  per  controller  can  be  seen.  The  processing  times  for  the  1:1 0  and  1:5 
weapon-to-target  ratios  decreased  in  proportion  to  the  number  of  additional  proces¬ 
sors  per  controller  group:  doubling  the  processors  per  controller  reduced  the  process¬ 
ing  times  by  approximately  half.  In  the  1:1  ratio  case,  the  reduction  in  processing 
times  was  not  as  evident.  This  is  related  to  the  processing  time  behavior  discussed 
in  the  level  1  presentation.  The  1:5  and  1:10  ratio  cases  provide  more  choices  for 
allocating  weapons  and  the  solution  is  obtained  quicker  due  to  the  highly  rectangular 
partitions.  Although  the  partitions  in  the  1:1  case  are  also  rectangular,  there  are 
fewer  targets  to  choose  from  and  computing  the  partition  solutions  is  more  likely  to 
require  iterations  of  the  reshuffling  portion  of  the  B&L  algorithm. 


As  in  the  level  1  implementation  results,  the  1:1  ratio  cases  produced  superlin- 
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ear  speedups.  However,  the  Sell  values  were  not  as  large  as  those  of  level  1.  This 
is  due  to  the  increased  processing  times  which  are  a  result  of  the  additional  over¬ 
head  involved  with  the  coordination  within  controller  groups.  The  nodes  running 
the  controller  processes  and  the  assign  processes  are  basically  synchronous.  When 
the  controller  process  is  active,  then  the  assign  processes  are  idle  and  vice  versa. 
This  creates  a  situation  where  there  are  always  idle  node  processors  and  reduces  the 
speedups  obtainable.  The  situation  for  the  1:5  and  1:10  ratios  is  similar,  but  the 
time  required  by  the  controller  process  is  longer  because  of  the  factor  of  5  or  10 
increase  in  the  number  of  targets  that  must  be  coordinated.  This  causes  the  assign 
processes  to  remain  idle  longer  and  results  in  speedups  being  less  than  those  of  the 
1:1  cases. 

Although  the  coordination  of  redundant  pairings  does  comprise  a  portion  of  the 
processing  time,  the  regression  models  obtained  for  the  level  2  times  were  very  similar 
to  the  level  1  models.  This  indicates  that  the  coordination  does  not  completely 
dominate  the  processing  time.  An  example  of  the  type  model  obtained  is  shown  in 
Equation  5-4  for  the  18-processor,  1:10  weapon-to-target  ratio  case. 


*  processing  time  =  -1.95412  x  10_9(WT2)  +  2.9718  x  10_6(T2)  -  0.26351 1  (5  -  4) 

The  terms  W  and  T  refer  to  the  number  of  weapons  and  targets,  respectively.  The 
Y-intercept  terms  were  found  to  be  significant  in  models  for  this  implementation, 
which  is  where  some  of  the  added  computations  estimated  for  the  level  2  model  are 
accounted  for.  The  R 2  coefficient  was  at  least  0.95  for  all  models,  which  indicate-  a 
very  good  fit  between  the  actual  and  predicted  processing  times. 

*  The  mean  assignment  costs,  percent  weapons  effective,  percent  wrap":,-  ■' 
and  an  additional  measure  labeled  %  Idle  are  shown  in  Table  5-5.  I  lx-  '  i 
sure  is  unique  to  this  implementation.  It  is  a  result  of  the  coordma- h  ;• 

m  instead  of  different  processes  possibly  assigning  multiple  wea;><  ■ 
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get,  the  weapons  associated  with  higher  cost  redundant  assignments  are  placed  in  an 
idle  state  for  future  use  rather  than  being  “wasted”  on  an  already- assigned  target. 
In  general,  this  implementation  idled  a  higher  percentage  of  the  weapons  when  the 
weapon-to-target  ratio  was  1:1.  This  results  from  the  coordination  action  where, 
instead  of  wasting  redundantly  allocated  weapons,  they  are  idled  for  future  use. 
More  redundancies  occur  in  the  1:1  case,  which  in  turn  yields  a  higher  percentage 
of  the  weapons  in  an  idle  state.  Some  weapons  are  still  wasted  because  there  is  no 
coordination  of  assignments  between  controller  groups.  For  the  1:10  ratio,  very  few 
weapons  were  wasted  and  less  than  20%  were  idled.  This  stems  from  fewer  redun¬ 
dancies  both  within  and  among  the  controller  groups.  The  greatest  advantage  of 
this  implementation  is  that  the  idled  weapons  are  available  for  future  assignments 
where  they  can  possibly  be  used  in  a  more  cost  effective  manner. 


Table  5-5.  Assignment  Results  of  the  Level  2  Implementation 


Wpns 


Proc/Cntrl  Tot  Proc 
_ 


1328.0 

1180.8 

1115.2 

1372.8 
1281.6 

1436.8 
712.0 
696.0 

689.6 
736.0 

729.6 
748.8 

686.4 

662.4 

651.2 

699.2 
688.0 
712.0 


Effective 

65.8 

62.7 

61.0 

62.7 
31.0 
61.0 

84.8 
83.1 
82.7 
83.1 
82.7 

82.7 

87.7 
84.6 
83.3 
84.6 
83.3 
83.3 


%  Wasted 

22X 


5.2.4  Level  3  The  level  3  implementation  introduces  more  coordination  be¬ 
tween  the  same  configuration  of  processors  found  in  the  level  2  program.  After  each 
iteration  of  the  B&L  algorithm  in  the  “assign”  processors,  the  controller  eliminates 
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the  redundant  assignments  and  sends  out  vectors  containing  information  to  be  used 
in  the  computation  of  new  assignments. 


Table  5-6.  Timing  and  Speedups  of  the  Level  3  Implementation 


Wpns 

Tgts 

Cntrl 

Proc/Cntrl 

Tot  Proc 

Time  (sec) 

Sbll 

$Sort 

96 

96 

2 

2 

6 

0.9730 

9.15 

36.99 

96 

96 

2 

4 

10 

1.8905 

4.71 

19.04 

96 

96 

2 

8 

18 

3.5370 

2.52 

10.17 

96 

96 

4 

2 

12 

2.1135 

4.21 

17.03 

96 

96 

4 

4 

20 

2.5548 

3.48 

14.09 

96 

96 

8 

2 

24 

1.4069 

5.48 

25.58 

96 

480 

2 

2 

6 

2.4975 

3.09 

— 

96 

480 

2 

4 

10 

4.3405 

1.78 

— 

96 

480 

2 

8 

18 

6.0985 

1.26 

— 

96 

480 

4 

2 

12 

4.0150 

1.92 

— 

96 

480 

4 

4 

20 

4.6833 

1.65 

— 

96 

480 

8 

2 

24 

2.6573 

2.90 

— 

96 

960 

2 

2 

6 

4.8910 

3.08 

— 

96 

960 

2 

4 

10 

8.4135 

1.79 

— 

96 

960 

2 

8 

18 

11.6000 

1.30 

— 

i*t> 

960 

4 

2 

12 

7.6133 

1.98 

— 

96 

960 

4 

4 

20 

9.0953 

1.66 

— 

S6 

960 

8 

2 

24 

5.2775 

2.85 

— 

This  implementation  was  very  expensive  in  terms  of  processing  times  as  illus¬ 
trated  by  the  times  and  speedups  in  Table  5-6.  Except  for  the  first  entry  in  Table  5-6, 
none  of  the  SgtiL  speedups  were  better  than  perfect.  The  extra  iterations  of  the  Bi:L 
algorithm,  combined  with  the  coordination  process,  substantially  increased  the  pro¬ 
cessing  times  and  thereby  reduced  the  speedups.  As  more  processors  were  added  to  a 
controller  partition,  the  processing  time  increased  rather  than  decreased.  The  reason 
the  processing  times  increased  so  dramatically  is  that  if  redundancies  remain  after 
an  iteration,  then  new  information  vectors  must  be  assembled  and  all  processors 
must  recompute  another  assignment  on  their  given  partition  based  on  the  updated 
information.  This  procedure  continues  until  all  redundancies  are  eliminated.  The 
increase  in  processing  times  is  an  especially  undesirable  effect  since  the  objective  is 
to  decrease  rather  than  increase  the  time  as  more  processors  are  utilized. 

Regression  analyses  yielded  very  similar  models  for  this  implementation  when 
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compared  to  the  level  1  and  level  2  models.  One  difference  in  the  models  obtained 
for  this  implementation  is  that  the  Y-intercept  term  became  more  significant  and 
positive.  This  is  an  indication  that  there  are  increasing  overheads  involved  with 
the  coordination  process.  The  estimation  of  the  extra  computation  involved  with 
the  “controller”  process  made  in  Chapter  4  could  not  be  confirmed  because  the 
regression  models  were  computed  with  the  number  of  processors  held  constant.  The 
SAS  software  regarded  the  models  with  processors  as  a  variable  as  “not  of  full  rank.” 
This  means  the  results  obtained  would  be  misleading  and  biased  because  of  certain 
inter-relationships  between  the  different  terms  in  the  model.  The  model  shown  in 
Equation  5-5  is  for  the  1:1  ratio,  32- weapon,  20-processor  case. 

processing  time  =  -5.924  x  10'7(WT2)  +  0.0000487T2  +  0.267500  (5  -  5) 

Assignment  results  for  the  level  3  program  are  shown  in  Table  5-7.  The  per¬ 
cent  weapons  wasted  decreased  as  the  ratio  of  weapons-to- targets  was  increased. 
This  trend  was  also  noted  on  the  results  for  other  numbers  of  weapons.  The  assign¬ 
ment  costs  for  level  3  were  higher  than  any  of  the  other  implementations.  This  can 
be  at  least  partially  explained  by  the  method  used  to  reassign  weapons  that  were 
redundantly  allocated.  In  cases  when  another  target  must  be  selected  because  of  a 
redundancy,  it  will  be  at  least  equal  to  and  probably  a  higher  cost  than  the  originally 
selected  target.  The  combination  of  several  substantially  higher  cost  reassignments 
drives  up  the  average  cost  dramatically  as  shown  by  the  cost  data  in  Table  5-7. 

5.2.5  Level  4  The  level  4  implementation  is  an  attempt  to  perform  several 
of  the  tasks  of  the  sequential  B&L  algorithm  in  parallel.  As  the  timing  and  speedup 
results  in  Table  5-8  illustrate,  the  effort  did  not  perform  as  well  as  one  would  hope. 
The  major  bottleneck  was  the  amount  of  interprocessor  communications  required 
to  update  global  information  used  in  the  selection  of  potential  weapon -to- target 
pairings. 
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Table  5-7.  Assignment  Results  of  the  Level  3  Implementation 


Wasted 


96 

96 

2 

2 

6 

15696.0 

71.3 

28.7 

96 

96 

2 

4 

10 

18886.0 

68.5 

31.5 

96 

96 

2 

8 

18 

17457.6 

68.5 

31.5 

96 

96 

4 

2 

12 

13412.0 

64.9 

35.1 

96 

96 

4 

4 

20 

16022.4 

62.9 

37.1 

96 

96 

8 

2 

24 

7556.8 

61.3 

38.7 

96 

480 

2 

2 

6 

9344.0 

87.3 

12.7 

96 

480 

2 

4 

10 

8501.2 

86.7 

13.3 

96 

480 

2 

8 

18 

10070.4 

86.3 

13.7 

96 

480 

4 

2 

12 

4572.8 

83.5 

16.5 

96 

480 

4 

4 

20 

6408.0 

83.5 

16.5 

96 

480 

8 

2 

24 

2606.6 

82.7 

17.3 

96 

960 

2 

2 

6 

6779.2 

89.8 

10.2 

96 

960 

2 

4 

10 

11288.0 

88.1 

11.9 

96 

960 

2 

8 

18 

11124.8 

87.3 

12.7 

96 

960 

4 

2 

12 

4798.4 

86.3 

13.7 

96 

960 

4 

4 

20 

7107.2 

85.2 

14.8 

96 

960 

8 

2 

24 

2947.2 

84.0 

16.0 

The  processing  times  for  this  implementation  were  only  marginally  better  than 
the  results  obtained  for  the  sequential  B&L  algorithm  (i.e.,  level  1,  one  processor).  In 
some  cases,  the  sequential  B&L  implementation  was  actually  faster  than  the  level  4 
implementation.  It  is  difficult  to  discern  the  exact  reason  for  the  poor  performance. 
One  problem  noted  during  the  gathering  of  the  results  was  that  the  default  number 
of  buffers  in  the  iPSC  used  to  handle  internode  message  traffic  was  too  small.  When 
additional  buffers  were  made  available,  then  the  amount  of  memory  remaining  was 
inadequate  for  processing  larger  cost  matrices.  This  definitely  had  an  effect  on  the 
speed  of  the  level  4  implementation. 

Another  possible  reason  for  the  poor  performance  is  that  the  parallel  algorithm 
used  was  too  inefficient.  One  especially  time  consuming  task  was  the  search  for  inde¬ 
pendent  zero  elements.  Recall  that  the  cost  matrix  was  partitioned  into  “windows" 
that  were  searched  independently  and  in  a  certain  order  by  the  “parallel"  processors. 
After  each  parallel  processor  searched  one  of  its  windows,  then  all  of  the  partial  re¬ 
sults  were  transmitted  to  the  serial  processor  for  combination  and  transmittal  to  all 
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Table  5-8.  Timing  and  Speedups  of  the  Level  4  Implementation 


Weapons 

Targets 

Processors 

SBkL 

$Sort 

96 

96 

2 

4.969 

1.79 

7.24 

96 

96 

4 

4.563 

1.95 

7.89 

96 

96 

8 

5.075 

1.75 

7.09 

96 

96 

16 

6.616 

1.35 

5.44 

96 

96 

32 

15.711 

0.57 

2.29 

Kg] 

2 

12.373 

HX:  M 

— 

■  1 

4 

6.924 

— 

96 

8 

6.851 

— 

96 

■  1 

16 

— 

96 

n  a 

32 

32.695 

0.24 

— 

96 

960 

2 

17.522 

— 

96 

960 

4 

12.876 

— 

96 

960 

8 

11.898 

1.27 

— 

96 

960 

16 

0.84 

— 

parallel  processors  for  the  next  set  of  window  searches.  The  coordination  between 
processors  was  necessary  in  order  to  insure  no  redundancies  occurred.  There  may  be 
more  efficient  methods  for  performing  this  and  other  tasks,  but  the  basic  Hungarian 
method,  and  the  B&L  algorithm  in  particular,  may  be  intrinsically  serial  and  not 
parallelizable. 

The  regression  model  obtained  for  this  implementation  was  somewhat  different 
from  the  other  models.  This  was  expected  because  this  program  was  so  much  differ¬ 
ent  from  the  other  programs.  The  processing  times  were  affected  by  the  hardware 
limitations  to  the  extent  that  multiple  runs  had  to  be  made  with  different  numbers  of 
system  buffers  and  system  memory  allocations  before  the  processes  would  complete 
normally.  The  most  reliable  data  obtained  was  for  the  cases  where  four  and  eight 
processors  were  used,  so  these  data  were  used  as  the  basis  for  the  regression  analysis. 
The  model  for  the  8-processor,  32-weapon,  1:1  ratio  case  is  given  in  Equation  5-6. 


processing  time  =  0. 00004 75 (WT2)  -  0.00893(T2)  +  0.55860(VE)  -  8.359  (5  -  6) 


A  term  that  was  not  significant  in  any  of  the  other  models  is  W.  The  reason  for 
this  is  that  the  vectors  transmitted  and  the  combination  procedure  performed  by 


the  serial  processor  were  all  strongly  related  to  the  number  of  weapons. 


A  feature  of  the  level  4  program  is  that  the  weapons  are  always  100%  effective. 
However,  this  is  not  an  advantage  since  the  time  required  to  achieve  these  results 
is  longer  than  the  single-processor  level  1  program.  The  assignment  results  of  the 
level  4  implementation  axe  shown  in  Table  5-9.  Essentially,  the  assignment  results 
axe  identical  to  those  of  the  level  1  implementation  utilizing  a  single  processor. 


Table  5-9.  Assignment  Results  of  the  Level  4  Implementation 


Weapons 

Targets 

Processors 

Cost 

%  Killed 

96 

96 

2 

1779.2 

100.0 

96 

96 

4 

1779.2 

100.0 

96 

96 

8 

1779.2 

100.0 

96 

96 

16 

1779.2 

100.0 

96 

480 

2 

756.8 

100.0 

96 

480 

4 

756.8 

100.0 

96 

480 

8 

756.8 

100.0 

96 

480 

16 

756.8 

100.0 

96 

960 

2 

723.2 

100.0 

96 

960 

4 

723.2 

100.0 

96 

960 

8 

723.2 

100.0 

96 

960 

16 

723.2 

100.0 

5.3  Analysis  of  Results 

In  this  section,  the  performance  results  of  the  different  implementations  are 
compared  and  evaluated.  The  relative  advantages  and  disadvantages  of  each  im¬ 
plementation  are  also  discussed.  Graphs  are  used  to  illustrate  trends  and  make 
additional  comparisons  between  the  different  implementations. 

5.3.1  Computation  Times  and  Speedup  The  computation  times  varied  widely 
between  the  different  programs.  The  fastest  times  were  those  of  the  level  1  and  level  2 
programs,  while  the  slower  times  were  where  those  of  the  level  3  and  level  4  programs. 
This  can  be  mainly  attributed  to  the  volume  and  frequency  of  communications  be¬ 
tween  processors  in  the  different  implementations.  As  the  level  of  coordination  and 
communications  increased,  the  computation  times  also  increased.  The  correspond- 


ing  speedups  over  the  level  0  program  and  the  single-processor  level  1  program  show 
the  reverse  trend  since  Ts  is  a  divisor  in  the  calculation  of  speedup.  The  speedups 
obtained  by  the  four  levels  of  implementation  over  the  level  0  program  for  96  x  96 
cost  matrices  are  illustrated  in  Figure  5-1. 


Y 


Figure  5-1.  Speedups  over  the  Level  0  Implementation  (1:1  Weapon-Target  Ratio) 

The  speedups  over  the  single-processor  level  1  implementation  are  shown  in 
Figures  5-2  and  5-3.  The  single-node  level  1  results  are  equivalent  to  a  sequential 
version  of  the  B&L  algorithm.  In  all  cases,  the  level  1  and  level  2  implementations 
exhibited  substantial  speedups  over  the  single-processor  programs.  The  superlinear 
speedups  in  the  1:1  ratio  cases  at  first  do  not  seem  possible.  The  explanation  for  why 
the  B&L  algorithm  works  faster  for  rectangular  matrices  than  for  square  matrices 
was  given  in  the  presentation  of  the  level  1  results.  Recall  from  that  discussion 
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that  when  an  initially  square  cost  matrix  is  partitioned,  each  processor  receives  a 
rectangular  submatrix.  For  rectangular  matrices,  it  is  more  likely  that  the  final 
assignment  solution  will  be  reached  in  the  formation  of  the  initial  solution  because 
of  the  increased  number  of  targets  from  which  assignments  can  be  made.  This  results 
in  bypassing  of  some  of  the  reshuffling  portions  of  the  B&L  algorithm  and  yields  a 
faster  solution  to  each  partition. 

In  the  1:5  and  1:10  weapon-to-target  ratios,  the  speedups  were  not  as  great 
because  the  previously  discussed  performance  for  rectangular  matrices  was  already 
in  effect  for  the  single-processor  times.  However,  significant  speedups  were  still  ob¬ 
tained.  One  drawback  to  all  of  these  faster  partition  solutions  is  that  when  they  are 
combined  into  a  final  solution,  they  are  no  longer  optimal  because  of  redundancies, 
weapons  idled,  and  reassignments  to  other  targets  performed  by  the  different  imple¬ 
mentations.  However  in  most  cases,  the  advantage  in  processing  time  allows  many 
sub-optimal  assignments  to  be  computed  in  the  time  required  to  compute  only  one 
optimal  solution. 

The  speedups  obtained  by  the  level  3  and  level  4  programs  were  disappoint¬ 
ing.  They  point  out  how  extensive  communications  between  processors  and  multiple 
iterations  of  the  B&L  algorithm  severely  affected  the  processing  times.  However, 
even  with  the  poorer  performance  when  compared  with  the  other  parallel  imple¬ 
mentations,  the  level  3  and  level  4  programs  did  produce  speedups  over  the  sorting 
method  used  as  the  baseline  for  comparison. 

5.3.2  Interprocessor  Communications  The  effects  of  increased  interprocessor 
communications  are  illustrated  by  the  longer  processing  times  of  the  level  3  and 
level  4  implementations.  The  ratio  of  computations-to-communications  becomes 
very  small  as  the  level  of  communications  is  increased  because  much  more  time  is 
spent  communicating  than  computing.  The  difference  in  processing  times  between 
the  level  1  and  level  2  programs  is  not  excessive  because  the  coordination  process 
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•  96  x  96  level  1 
o  :  96x96  level  2 
o  :  96  x  96  level  3 

•  :  96  X  96  level  4 

X  :  Number  of  processors 
Y  :  Speedup  over  Single-Node  Level  1 


Figure  5-2.  Speedups  over  the  Single-Node  Level  1  Implementation  (1:1  Ratio) 

requires  a  relatively  small  amount  of  time.  All  other  operations  in  the  level  1  and 
level  2  programs  are  essentially  the  same.  In  the  level  3  program,  the  coordination 
process  involves  higher  interprocessor  message  traffic  to  control  the  extra  iterations 
required  to  eliminate  the  redundancies.  The  transmission  of  vectors  designating 
the  new  assignment  instructions  in  level  3  is  similar  to  the  vectors  transmitted  in 
level  4  for  coordinating  the  search  of  the  matrix  partitions  for  assignments.  The 
processing  times  of  the  level  3  and  level  4  implementations  reflect  the  extra  time 
spent  communicating  instead  of  computing. 

The  desirability  of  complete  independence  between  processors  can  been  seen 
upon  comparing  the  processing  times  of  the  level  1  and  level  4  implementations. 
The  level  4  implementation  is  generally  less  than  twice  as  fast  the  single-prorossor 
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Figure  5-3.  Speedups  over  the  Single-Node  Level  1  Implementation  (1:10  Ratio) 

level  1  program.  For  a  96  weapon.  960  target  problem,  the  level  4  implementation 
using  8  processors  solved  the  problem  only  slightly  faster  than  the  single-processor 
level  1  implementation.  However,  the  level  1,  8-processor  configuration  solved  the 
same  problem  6  times  faster  than  level  4  using  8  processors.  However,  the  resulting 
assignments  and  costs  were  somewhat  different:  84.6%  of  the  weapons  were  effective 
for  level  1  vs.  100%  effective  for  level  4.  Except  for  the  100%  effectiveness  of  the 
allocated  weapons  for  level  4,  the  added  communications  and  iterations  of  the  level  3 
and  level  4  implementations  do  not  appear  to  provide  any  particular  advantages. 

5.3.3  Problem  Scalability  The  relationship  between  the  size  of  the  problem 
and  the  size  of  the  machine  (number  of  processors  used)  is  difficult  to  assess.  In  the 
cases  studied  in  this  research,  the  optimum  number  of  processors  varied  from  one 
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implementation  to  another.  For  the  level  1  implementation,  the  1:1  ratio  speedups 
obtained  tended  to  decrease  as  the  number  of  weapons  and  targets  increased.  For 
example,  the  speedup  Sell  for  the  32-weapon,  32-target  problem  using  4  processors 
was  24.38  while  the  128-weapon,  128-target  problem  speedup  using  4  processors  was 
only  18.07.  The  situation  for  the  1:5  and  1:10  ratios  was  completely  different.  In¬ 
creasing  the  number  of  weapons  and  targets  while  holding  the  weapon-to-target  ratio 
constant  provided  some  interesting  results:  as  the  weapons  and  targets  increased,  the 
speedups  remained  nearly  constant.  For  example,  the  speedup  Sell  for  the  32  x  320 
cost  matrix  using  16  processors  w as  15.62.  But  the  speedup  for  the  128  x  1280  cost 
matrix  for  16  processors  was  15.92.  These  trends  indicate  that  for  weapon-to-target 
ratios  greater  than  or  equal  to  1:5,  close  to  perfect  speedups  are  possible  even  as 
the  problem  is  greatly  enlarged.  Some  limit  to  the  problem  size  probably  exists,  but 
increasing  the  problem  size  by  a  factor  of  16  and  still  obtaining  roughly  the  same 
speedup  is  a  good  indicator  that  much  larger  problems  can  be  solved  with  reasonably 
good  speedups  over  the  sequential  processing  times. 

Although  the  Sb^l  speedups  for  the  level  2  implementation  were  not  as  close 
to  linear  as  the  level  1  results,  there  were  similar  trends  in  scalability.  For  the  1:5 
and  1:10  ratios,  the  speedups  remained  fairly  constant  with  a  few  showing  some 
slight  increase  as  the  problem  size  increased  from  32  to  128  weapons.  One  difference 
was  in  the  1:1  ratio  cases  where,  instead  of  the  speedups  decreasing  as  they  did  in 
the  level  1  implementation,  the  speedups  Sbul  and  Ss0rt  also  increased  slightly  as 
the  problem  size  increased.  Based  on  these  results,  the  level  2  implementation  also 
appears  to  be  a  good  candidate  for  solving  larger  problem  sizes. 

For  both  level  1  and  level  2  weapon-to-target  ratios  greater  than  1:1,  an  in¬ 
crease  in  the  weapon-to-target  ratio  appears  to  decrease  the  speedups  obtained.  By 
observing  the  plots  in  Figure  5-4,  these  speedups  appear  to  be  close  to  linear.  But 
there  is  a  slight  difference  between  the  1:5  and  1:10  plots.  For  the  level  1  implementa¬ 
tion,  if  there  were  a  128  processor  machine  available,  the  speedups  would  approach 


128  and  so  on  for  larger  machines.  For  the  level  2  implementation,  the  speedups 
would  not  be  as  great  as  the  level  1  speedups,  but  as  the  size  of  the  machine  in¬ 
creased,  there  would  be  a  corresponding  improvement  in  the  processing  times  and 
speedups  obtained.  These  are  conjectures,  but  they  are  based  on  observations  and 
trends  of  the  data  collected.  There  is  no  way  to  predict  precisely  what  the  behavior 
of  the  processing  times  would  be  for  larger  machines.  However,  for  the  range  of 
problem  and  machine  sizes  tested,  it  is  reasonable  to  expect  similar  results  for  larger 
machines  and  problems. 


Figure  5-4.  Level  1  Speedups  over  the  Sequential  B&L  Algorithm  (96  Wpns) 

Up  to  this  point,  the  discussion  has  focused  on  the  level  1  and  level  2  imple¬ 
mentations  because  the  programs  are  very  similar  and  the  speedups  obtained  were 
the  largest.  For  the  other  implementations,  the  speedups  rapidly  fell  victim  to  the 


communications  overhead  as  the  number  of  processors  were  increased.  Except  for  the 
level  1  and  level  2  implementations,  the  processing  times  generally  increased  rather 
than  decreased  as  more  processors  were  used.  Because  of  the  trends  in  processing 
times  observed  in  the  level  3  and  level  4  implementations,  the  extension  to  larger 
problem  sizes  and  correspondingly  larger  numbers  of  processors  does  not  seem  to  be 
feasible. 

5.3.4  Cost  and  Effectiveness  of  Assignments  The  processing  times  and  speedup 
have  been  the  main  measures  of  performance  emphasized  until  now.  The  manner  in 
which  the  available  weapons  are  utilized  is  also  very  important.  If  an  algorithm  is 
extremely  fast  but  yields  poor  weapons  utilization,  it  will  not  be  very  useful.  The 
assignment  results  of  all  the  implementations  were  presented  in  the  previous  section. 
For  comparison,  plots  of  weapon  effectiveness  for  1:1  and  1:10  weapon-to-target  ra¬ 
tios  are  shown  in  Figures  5-5  and  5-6. 

One  important  trend  to  note  between  Figures  5-5  and  5-6  is  that  the  percentage 
of  targets  killed  increases  as  the  ratio  goes  from  the  less  likely  1:1  (96  x  96)  ratio 
case  to  the  more  likely  1:10  (96  x  960)  ratio.  All  of  the  implementations  produced 
kill  percentages  above  80%  for  the  1:10  and  1:5  weapon-to-target  ratios.  Except  for 
the  100%  kill  percentages  for  the  relatively  slow  level  0  and  level  4  implementations, 
the  best  overall  assignment  performance  was  obtained  with  the  level  2  program. 
In  most  all  1:5  and  1:10  ratio  cases,  it  wasted  less  than  10%  of  the  weapons.  In 
general,  the  level  2  program  idled  more  weapons  than  it  wasted  while  yielding  kill 
percentages  comparable  to  the  other  implementations.  The  idling  of  weapons  rather 
than  wasting  them  is  important,  especially  when  weapons  are  scarce.  Idled  weapons 


can  be  withheld  until  a  later  assignment  iteration  when  they  may  be  utilized  in  a 
more  cost  effective  manner. 


The  associated  assignment  costs  are  shown  in  Figures  5-7  and  5-8.  The  assign¬ 
ment  cost  is  a  measure  of  how  expensive  the  overall  assignment  will  be  in  terms  of 


94 


96  x  96  level  1 
96  x  96  level  2 
96  x  96  level  3 
96  X  96  level  4 


X  :  Number  of  processors 
Y  :  Weapon  Percent  Effective 
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Figure  5-5.  Weapon  Effectiveness  vs.  Number  of  Processors  (1:1  Ratio) 


resources  utilized.  The  highest  costs  were  produced  by  the  level  3  program.  This 
was  explained  earlier  as  a  result  of  the  assignment  of  certain  weapons  to  higher  cost 
targets  when  redundancies  occur.  The  purpose  of  the  level  3  program  was  to  utilize 
as  many  of  the  weapons  as  possible  to  kill  all  possible  targets.  Situations  may  occur 
when  this  strategy  may  be  useful.  However,  upon  comparing  the  results  of  other 
programs,  level  2  killed  approximately  the  same  percentage  of  targets  at  a  generally 
lower  cost  and  wasted  fewer  weapons.  In  addition,  the  level  2  processing  times  were 
much  faster  than  the  level  3  program. 


•  :  96  x  96  level  1 
o  :  96  x  96  level  2 
o  :  96  x  96  level  3 
<  :  96  x  96  level  4 

X  :  Number  of  processors 
Y  :  Assignment  Cost 
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Figure  5-7.  Assignment  Cost  vs.  Number  of  Processors  (1:1  Ratio) 
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5.4  Summary 

This  chapter  has  presented  and  analyzed  the  performance  results  of  all  the 
programs  developed  in  this  research.  It  first  explained  the  testing  approach  and 
then  defined  the  criteria  used  to  measure  the  performance  of  the  implementations. 
The  results  for  each  program  were  presented  and  followed  by  an  assessment  of  the 
performance  characteristics.  Regression  analyses  provided  some  insight  into  how  the 
processing  times  behaved  with  the  addition  of  coordination  and  communications. 
The  speedups  and  processing  times  of  all  implementations  were  compared  and  an¬ 
alyzed.  Also,  the  communications  overhead,  scalability,  and  effectiveness  of  the 
assignments  were  evaluated.  The  level  2  program,  which  involved  a  modest  amount 
of  coordination  and  communications,  produced  the  best  overall  performance. 


6.  Conclusions  and  Recommendations 


m 

I 


* 


* 


# 


Before  the  conclusions  and  recommendations  of  this  research  are  presented,  a 
review  of  the  research  is  in  order.  Beginning  in  Chapter  1,  an  overview  of  parallel 
processing  as  it  relates  to  SDI  was  presented.  The  general  problem  of  assignment  was 
introduced  and  its  importance  to  the  BM/C3  system  emphasized.  The  objectives 
and  assumptions  were  stated  in  order  to  define  a  reasonable  scope  to  the  research. 
Chapter  2  presented  a  detailed  background  on  parallel  processing  encompassing  the 
architectures  of  parallel  processors,  the  hardware  organization  of  the  Intel  hypercube 
computer,  the  techniques  for  developing  parallel  software  implementations,  and  a 
survey  of  recent  parallel  implementations  developed  in  the  field. 

Chapter  3  defined  the  assignment  problem  and  reviewed  some  of  the  important 
sequential  algorithms  developed  to  solve  the  assignment  problem.  The  transporta¬ 
tion  and  the  Hungarian  algorithms  were  chosen  for  comparison  and  evaluation.  The 
Hungarian  method  was  chosen  as  the  basis  for  the  parallel  implementations.  The 
divide  and  conquer  strategy  was  chosen  as  the  high-level  parallel  strategy  for  com¬ 
bining  the  partial  problem  solutions  into  an  overall  solution.  In  Chapter  4,  the 
implementations  of  two  sequential  and  four  parallel  assignment  programs  were  ex¬ 
plained.  The  complexity  of  each  implementation  was  estimated  based  on  high-level 
operations.  Three  of  the  parallel  programs  utilized  the  sequential  B&A  algorithm 
and  involved  different  types  of  partitioning  and  interprocessor  communications.  The 
fourth  parallel  implementation  was  a  parallelized  version  of  the  B&A  algorithm. 
Chapter  5  presented  the  experimental  results  and  a  performance  analysis  of  each 
of  the  implementations.  The  performance  measures  of  computation  time,  speedup, 
load- balancing,  and  problem  scalability  were  evaluated. 

The  remainder  of  this  chapter  will  focus  on  the  implications  of  this  research 
and  form  some  conclusions.  It  will  end  with  recommendations  for  applications  of 
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this  research  and  topics  for  further  research  in  the  area  of  parallel  processing  and 

BM/C3. 

6.1  Parallel  Processing:  Lessons  Learned 

The  four  parallel  implementations  completed  in  this  research  all  served  to  il¬ 
lustrate  certain  advantages  and  disadvantages  of  parallel  processing.  The  first  and 
foremost  disadvantage  is  that  all  problems  cannot  be  solved  in  parallel.  In  some  cases, 
the  computational  overheads  and  interprocessor  communications  overpower  any  ad¬ 
vantage  gained  by  performing  certain  operations  in  parallel.  This  was  illustrated  by 
the  poor  performance  of  the  level  4  implementation  where  several  operations  were 
attempted  in  parallel.  The  main  problem  with  the  level  4  implementation  was  the 
method  used  to  decompose  the  problem.  The  “windows”  were  used  to  allow  multiple 
processors  to  search  for  possible  assignments  and  insure  that  none  of  those  assign¬ 
ments  were  redundant.  There  are  other  methods  for  storing  portions  of  matrices 
in  different  processors  where  the  data  is  more  easily  accessible.  But  an  underlying 
problem  with  the  B&A  algorithm  in  particular  and  the  Hungarian  method  in  general 
is  that  a  large  number  of  its  operations  appear  to  be  intrinsically  serial  in  nature. 
In  the  final  analysis,  the  time  penalty  for  parallelizing  the  operations  of  the  B&A 
algorithm  was  just  too  great.  Much  better  performance  was  achieved  with  the  level  2 
implementation  where  minimal  amounts  of  communications  were  used.  In  the  level  2 
implementation,  a  sequential  algorithm  was  used  to  solve  partitions  of  the  overall 
problem  in  parallel.  A  small  amount  of  communications  also  proved  to  be  better 
than  no  communications  at  all.  This  was  illustrated  by  the  improved  assignment  re¬ 
sults  and  minimal  time  penalty  of  the  level  2  program  over  the  non-communicating 
level  1  program. 

The  size  of  the  problem  partitions  also  play  an  important  role  in  how  well 
a  parallel  implementation  performs.  In  this  research,  problem  solutions  utilizing 
a  larger  number  of  small  partitions  produced  noticeably  better  results  than  did  a 


small  number  of  larger  partitions.  One  reason  for  this  appears  to  be  a  function  of 
the  sequential  algorithm  used  in  the  node  processors.  Other  algorithms  may  or  may 
not  yield  the  same  results. 

Another  problem  observed  is  that  the  balancing  of  the  computational  load 
between  the  processors  has  an  important  effect  on  the  performance.  The  load  balance 
of  the  level  1  and  level  2  programs  appeared  to  be  relatively  even.  The  problem  arose 
in  the  level  3  and  level  4  programs.  The  controller  processor  in  level  3  became  the 
bottleneck  to  completing  the  problem  solution.  After  the  assign  processors  completed 
one  iteration  of  the  assignment  algorithm,  they  remained  idle  until  the  controller 
processor  completed  a  serial  process  to  determine  if  further  processing  was  needed. 
While  the  assign  processors  computed  another  iteration,  the  controller  processors 
remained  idle. 

In  summary,  achieving  fast  and  efficient  parallel  processing  appears  to  rely  on 
three  fundamental  rules:  (1)  The  problem  must  be  partitionable  into  a  number  of 
independent  subproblems.  (2)  The  communications  between  the  processing  elements 
must  be  kept  to  a  minimum.  (3)  The  computations  performed  by  each  processor  must 
be  approximately  equal  and  simultaneous. 

6.2  Areas  of  Application 

The  assignment  problem  solved  in  this  research  was  very  general.  In  Chapter  3. 
the  background  information  on  assignment  algorithms  revealed  that  many  types  of 
problems  can  be  solved  using  the  same  basic  techniques.  Areas  such  as  circuit  board 
routing,  network  flow  analysis,  and  allocation  of  resources  were  cited.  The  applica¬ 
tion  of  weapon-target  assignment  algorithms  is  certainly  not  limited  to  the  missile 
defense  system  proposed  by  the  SDIO.  Smaller-scale  battle  management  systems 
could  also  benefit  from  the  application  of  parallel  assignment  algorithms  to  aid  in 
speeding  up  the  decision  processes.  In  addition,  other  functions  of  battle  manage¬ 
ment  where  fast  processing  of  large  amounts  of  data  is  necessary  could  certainly  be 
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performed  in  parallel.  The  implementations  could  be  realized  using  the  development 
techniques  and  guidelines  presented  in  this  research. 

6.3  Recommendations  for  Further  Research 

The  results  of  this  research  show  that  significant  decreases  in  processing  times 
are  possible  by  using  multiple  processors.  The  performance  of  the  level  2  imple¬ 
mentation  illustrated  that  there  needs  to  be  a  balance  between  communications 
and  computations.  The  Intel  iPSC  used  in  this  research  is  a  loosely-coupled  paral¬ 
lel  processor  machine.  A  shared-memory  machine  described  in  Chapter  2  was  not 
available  for  use  when  this  research  began,  but  one  has  recently  been  obtained  by 
the  department.  A  natural  extension  would  be  to  compare  the  results  obtained  in 
this  research  with  the  results  of  assignment  algorithms  implemented  on  the  shared- 
memory  machine.  The  reduction  in  interprocessor  message-passing  and  the  sharing 
of  assignment  information  between  processors  through  the  common  memory  could 
prove  interesting. 

Different  types  of  heuristics  for  reducing  the  redundant  assignments  could  also 
be  a  topic  for  further  research.  The  elements  of  the  cost  matrix  were  random  val¬ 
ues  in  a  specified  range.  Time  did  not  permit  experimentation  with  the  effects  of 
different  groups  of  weapons  that  have  similar  opportunities  for  engaging  the  same 
targets.  The  method  of  deriving  the  cost  information  could  also  be  expanded  and 
improved.  Instead  of  random  numbers  for  cost  values,  further  work  with  simulation 
programs  could  be  done  to  derive  more  representative  values.  The  inclusion  of  sta¬ 
tistical  probabilities  of  target  kills  based  on  specific  weapons  and  targets  is  another 
possibility. 

In  Chapter  4,  the  assignment  process  was  assumed  to  be  memoryless.  The 
assignment  depended  only  on  the  present  state  of  the  system  and  the  present  in¬ 
put.  Expansion  of  the  implementations  to  include  consideration  of  past  assignments 
could  yield  improved  results.  In  addition,  methods  to  predict  possible  trends  in  fu- 
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ture  assignments  could  also  be  beneficial,  especially  in  situations  where  the  weapon 
resources  are  expected  to  be  limited  or  very  expensive. 

In  closing,  this  research  has  demonstrated  that  parallel  processing  provides 
benefits  and  creates  liabilities.  Some  of  the  benefits  were  demonstrated  in  the  com¬ 
putation  times  and  speedups  obtained  with  the  implementations.  But  the  results 
were  not  completely  optimal.  This  is,  of  course,  just  one  of  the  liabilities.  Each 
application  will  possibly  involve  tradeoffs  of  one  type  or  another.  Further  research 
in  the  area  of  parallel  algorithms  and  parallel  software  implementations  can  build 
on  the  results  presented  in  this  thesis  and  yield  further  performance  improvements. 


Appendix  A.  The  Transportation  Method 


In  Chapter  3,  the  transportation  method  of  solving  the  assignment  problem 
was  briefly  described.  This  appendix  presents  the  steps  of  the  algorithm  in  detail 
and  points  out  the  similarities  between  the  transportation  method  and  the  simplex 
method  from  which  it  was  derived.  Following  the  algorithm  presentation,  an  example 
problem  is  given  that  illustrates  how  the  algorithm  operates 

There  are  two  phases  to  the  transportation  method.  The  first  phase  is  to 
formulate  the  initial  basic  feasible  solution.  The  second  phase  checks  the  initial 
solution  for  optimality  and  incrementally  improves  upon  it  until  it  is  optimal.  There 
have  been  several  methods  devised  to  provide  the  initial  solution,  but  one  simple 
approach  known  as  the  “northwest  corner  rule”  will  be  given  here  [Chu  57], 

1-1  Initialize  the  table  by  setting  all  entries  to  null  (no  entry)  and  all  cX]  entries 
to  the  corresponding  cost  matrix  values. 

1-2  Beginning  with  the  cell  in  the  northwest  corner  of  the  table,  assign  the  min¬ 
imum  of  a,  or  bj,  which  correspond  to  the  row  availability  (resources)  and  column 
requirements  (requesters)  respectively,  to  the  x,:  variable.  For  the  assignment  prob¬ 
lem,  these  elements  will  always  be  one,  so  no  decision  needs  to  be  made.  Both  a, 
and  bj  are  reduced  to  zero  and  the  xi;  element  is  set  to  one. 

1-3  Eliminate  from  further  consideration  the  i  tk  row  and  j  th  column  containing 
the  x(J  element  just  modified.  This  effectively  reduces  the  dimension  of  the  table.  If 
no  rows  and  columns  remain  after  this  elimination,  the  initial  solution  is  complete. 
Otherwise,  repeat  Steps  1-2  and  1-3. 

When  the  initial  solution  is  complete,  the  number  of  cells  assigned  will  be 
n  and  they  will  form  a  northwest  to  southeast  diagonal  in  the  matrix.  However, 
one  restriction  of  the  transportation  method  is  that  the  number  of  assigned  cells  in 
an  n  x  m  cost  matrix  must  equal  n  +  m  —  1.  When  the  number  of  assigned  cells 


is  less  than  n  +  m  —  1,  then  the  solution  is  called  degenerate.  In  the  assignment 
problem,  the  initial  solution  is  always  degenerate  since  only  n  cells  of  the  required 
n  -f  n  —  1  are  assigned.  Additional  artificial  assignments  need  to  be  made  to  achieve 
a  nondegenerate  solution  in  order  for  the  second  phase  of  the  algorithm  to  work.  A 
method  of  generating  these  artificial  assignments  is  described  following  steps: 

1-4  Start  with  an  unassigned  cell  and  assign  this  cell  a  +0  designation. 

1- 5  A  0-path  is  a  loop  that  begins  and  ends  on  a  particular  unassigned  cell  by 
alternately  assigning  +0  and  — 0  designations  to  certain  assigned  cells  in  the  loop. 
This  0-path  loop  is  formed  by  making  one  or  more  horizontal  and  vertical  movements. 
Except  for  the  initial  and  final  movements  from  and  to  the  selected  unassigned  cell, 
each  movement  must  be  from  one  assigned  cell  to  another  and  form  a  segment  with 
assigned  cells  as  endpoints  by  traversing  one  or  more  cells  per  movement.  One 
subtlety  of  forming  the  0-path  is  that  all  assigned  cells  do  not  need  to  be  included  in 
the  path  and  some  cells  may  be  “skipped  over”  when  forming  the  path.  If  a  closed 
loop  can  be  formed  in  this  fashion,  the  unassigned  cell  is  termed  dependent.  The 
objective  is  to  identify  all  independent  cells,  which  are  those  cells  where  a  closed-loop 
0-path  cannot  be  formed.  Once  all  independent  unassigned  cells  are  identified,  then 
a  sufficient  number  of  artificial  e  allocations  are  made  to  these  cells  with  the  lowest 
cost  c,_,  to  form  the  required  n  +  n-1  assignments.  An  c  allocation  is  defined  as  a 
very  small  positive  number  which  will  be  set  to  zero  in  the  final  solution  to  obtain 
the  actual  allocation  [Ign82]. 

Once  the  required  number  of  t  allocations  are  made,  then  the  steps  of  phase 
II  can  be  performed  as  follows: 

2- 1  Add  an  additional  column  and  row  to  the  table  to  contain  row  indicators  R, 

and  column  indicators  Ky 

2-2  Given  a  non-degenerate  initial  solution  from  phase  I,  assign  a  zero  element  to 
any  of  the  /?,  or  A';  positions. 


vv 


2-3  For  each  cell  that  has  an  actual  or  artificial  value  for  the  xtJ  entry,  satisfy  the 
following  expression: 


# 


C 


—  R,  4-  hj  +  ctJ  —  0  ( A  —  1 ) 

The  initial  zero  /?,  or  A;  element  can  be  used  to  determine  the  missing  R ,  or  l\j 
element.  From  this  initial  determination,  all  other  values  of  Rx  and  K:  can  be 
determined  [Ign  82]. 

2-4  For  the  remaining  unassigned  cells,  determine  the  values  of  AtJ  using  the 
corresponding  A,,  K} ,  and  ctJ  values.  Enter  these  values  of  A,j  into  the  associated 
cell  in  the  upper  right  hand  corner. 

2-5  If  all  of  the  A,j  values  are  nonnegative  for  the  unassigned  cells,  then  the 
assignment  is  optimal  and  the  algorithm  terminates.  If  any  A1;  values  are  negative, 
then  the  solution  can  be  improved  and  step  2-6  must  be  performed. 

2-6  Select  the  unassigned  cell  with  the  most  negative  A,j  value.  In  the  case  of  a 
tie  in  Au  values,  choose  one  of  the  most  negative  A,_,  cells  arbitrarily.  This  step  is 
analogous  to  the  simplex  method  of  selecting  a  non-basic  variable  to  enter  into  the 
basic  solution  set.  The  present  assignment  must  be  changed  in  order  to  include  this 
new  variable,  which  requires  that  one  of  the  present  assigned  cells  (basic  variables) 
be  removed.  Go  to  Step  2-7. 

2-7  Construct  a  0-path  as  described  in  step  1-5,  beginning  with  the  cell  having  the 
most  negative  AtJ  value.  However,  this  time  the  objective  is  to  form  a  closfd-loop 
0- path. 

2-8  The  results  of  step  2-7  will  yield  some  cells  with  +0  designations  and  others 
with  —0  designations.  The  cells  with  +0  designations  will  become  assigned  cells  with 
x,;  values  of  one  and  all  —0  cells  become  unassigned  (no  entry  for  x,;).  This  step 
is  the  same  as  the  simplex  method  of  selecting  basic  variables  that  are  to  leave  the 
basic  solution  set. 
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2-9  If  the  new  assignment  obtained  in  Step  2-8  is  degenerate,  perform  steps  1-4 
and  1-5  to  add  the  required  number  of  e  allocations  to  form  a  nondegenerate  solution. 
Then  repeat  steps  2-2  to  2-8  until  an  optimal  solution  is  indicated  in  step  2-5. 

As  an  example  of  the  transportation  algorithm  just  described,  consider  the 
following  cost  matrix: 

f  7  4  3  8  1 


5  5  4  9 
2  7  9  2 
10  3  1  6 


(A  -2) 


Using  the  c, values  from  this  matrix,  the  initial  transportation  table  can  be 
formed  as  shown  in  Table  A-l.  The  null  xtJ  values  are  indicated  by  a 

Table  A-l.  Initial  Transportation  Table 


a,  i 


requester  - 
resource  J. 


1  12  13  14 


1  1 


The  initial  basic  feasible  solution  resulting  from  steps  1-2  and  1-3  of  the  tran 
portation  algorithm  is  shown  in  Table  A-2.  The  a,  column  and  the  b}  row  will  1 
omitted  in  later  representations  since  they  will  not  be  modified. 


Table  A-2.  Initial  Basic  Feasible  Solution 


One  possible  result  of  performing  steps  1-4  and  1-5  is  given  in  Table  A-3.  All 
unassigned  cells  in  Table  A-2  are  independent.  Three  additional  assignments  are 
needed  to  form  the  required  4  +  4  -  1  =  7  assignments.  The  three  independent 
cells  with  the  lowest  costs  were  selected  and  given  the  e  allocations  as  shown  in 
Table  A-3.  One  note  of  explanation  about  the  choice  of  e  assignments  is  warranted. 
The  assignment  of  an  c  to  cell  x42  instead  of  134  was  necessary  because  after  cell  x43 
was  assigned,  the  0-path  for  cell  x^  was  no  longer  independent.  This  means  that 
the  lowest  cost  unassigned  cells  are  not  always  given  the  e  allocations. 

Now  that  a  nondegenerate  basic  solution  has  been  obtained,  phase  II  of  the 
transportation  method  can  be  entered  to  determine  if  this  initial  solution  is  optimal. 
If  the  assignment  is  not  optimal,  then  it  will  be  modified  to  improve  it. 

As  required  by  Step  2-1,  the  additional  /?,  column  and  K}  row  are  added  to  the 
table.  An  initial  zero  element  is  arbitrarily  assigned  to  the  R4  position  by  Step  2-2 
and  is  shown  in  Table  A-4.  Any  of  the  other  positions  could  have  been  chosen  for 
this  initial  zero  element. 

The  first  iteration  of  step  2-3  is  shown  in  Table  A-5  where  the  AtJ  =  R,  +  Kj  +c,_, 
expression  is  satisfied  for  the  assigned  cells  in  row  4.  There  are  several  iterations 
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required  to  formulate  the  remaining  R  and  A';  elements.  One  possible  sequence  is 
shown  in  Tables  A-6  and  A-7. 


Table  A-5.  Calculation  of  Additional  R  and  K}  Values 


requester  - 
resource  [ 


1 


Table  A-6.  Additional  R  and  K3  Values 


R  i 


— 

— 

3 

8 

4 

9 

1 

- 

9 

2 

e 

1 

1 

6 

-1 

-6 

Once  all  the  R  and  K:  values  are  determined,  the  A,:  values  can  be  calculated 
for  the  unassigned  cells  and  entered  into  the  associated  cells.  Using  the  values  of  /?, 
and  K:  from  Table  A-7,  Step  2-4  yields  the  results  shown  in  Table  A-8. 
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Table  A- 7.  Complete  R,  and  A';  Values 


requester  — ► 
resource  [ 

1 

2 

3 

4 

1 

1 

1 

■ 

- 

- 

-13 

7 

B& 

3 

8 

2 

- 

l 

■1 

- 

-2 

5 

5 

H 

9 

3 

e 

l 

- 

-8 

2 

EH 

9 

2 

4 

- 

£ 

£ 

1 

0 

- 

10 

3 

1 

6 

A',- 

6 

-3 

-1 

-6 

Table  A-8.  AtJ  Values  for  Unassigned  Cells 


requester  — > 
resource  j. 

1 

2 

3 

4 

A,  | 

1 

1 

-12 

-11 

-11 

-13 

7 

4 

3 

8 

2 

+9 

1 

-  +1 

-  +1 

-2 

5 

5 

4 

9 

3 

£ 

-4 

1 

-  -12 

-8 

2 

7 

9 

2 

4 

-  +16 

£ 

£ 

1 

0 

10 

3 

1 

6 

6 


-3 


-1 


-6 


From  Step  2-5,  since  all  of  the  AtJ  values  are  not  nonnegative,  the  assignment 
is  not  optimal  and  can  be  improved  by  performing  Step  2-6.  The  most  negative  Au 
value  in  Table  A-8  is  -12,  which  is  associated  with  the  cells  x12  and  x34.  Ties  in  th< 
negative  values  may  be  broken  arbitrarily,  so  x^  is  chosen.  Now  the  0-path  must  be 
constructed  by  using  Step  2-7  so  that  the  assignments  may  be  shuffled  to  bring  th< 
x 34  variable  into  the  basic  solution.  The  resulting  0-path  is  shown  in  Table  A-9. 


Table  A-9.  0-Path  for  Exchange  of  Variables 


requester  — * 
resource  { 

1 

2 

3 

4 

11,  1 

1 

1 

-12 

-11 

-11 

-13 

■7 

( 

4 

3 

8 

2 

+9 

1 

-  +1 

-  +1 

-2 

5 

5 

4 

9 

3 

e 

-4 

1 

-12 

-8 

7 

9  -0 

2  +0 

4 

-  +16 

t 

c 

1 

0 

10 

3 

1  +0 

6  -0 

6 

-3 

J _ 

-6 

|  | 

Now,  the  x,}  values  of  the  cells  in  the  0-path  must  be  modified  to  form  the  new 
assignment  shown  in  Table  A- 10. 

The  new  assignment  is  degenerate  since  there  are  only  six  assignments  and 
seven  are  required.  Performing  Steps  1-4  and  1-5  yields  a  possible  nondegenerate 
basic  solution  shown  in  Table  A- 11. 

The  presentation  of  next  iteration  of  phase  II  will  slightly  abbreviated,  but  the 
intermediate  results  of  each  step  will  be  shown.  One  possible  result  of  performing 
Steps  2-2  and  2-3  is  shown  in  Table  A- 12. 

Performing  Step  2-4  results  in  following  Table  A-13.  Since  all  A,,  values  in 
Table  A-14  are  not  nonnegative,  the  solution  can  be  improved  upon  further.  Tin 
most  negative  A,,  value  in  Table  A-13  is  -2  (A2]  )•  A  0-path  beginning  with  this  cell 
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Table  A- 12.  Second  Set  of  R,  and  K}  Variables 

requester  — ►  1  2  3  4  R,  j 

resource  | 


is  shown  in  Table  A- 14  and  traverses  the  sequence  of  cells  A21.  An,  A 
A22.  and  A2i  to  form  a  closed  loop. 


The  assignment  resulting  from  reassigning  the  resources  from  the  —6  cells  to 
the  +#  cells  is  shown  in  Table  A-15.  Since  there  are  only  five  assignments  in  this 
new  table,  two  additional  t  allocations  must  be  made  using  Steps  1-4  and  1-5.  One 
possible  set  of  t  allocations  is  shown  in  Table  A- 16. 


Table  A-15.  Second  Iteration  Assignment 


requester  — * 
resource  J. 

1 

2 

3 

4 

R ,  1 

1 

- 

- 

1 

- 

1 

4 

3 

8 

2 

1 

- 

- 

- 

5 

5 

4 

9 

3 

c 

- 

- 

1 

2 

/ 

9 

2 

4 

- 

1 

- 

- 

10 

3 

1 

6 

Table  A- 16.  Third  Nondegenerate  Basic  Feasible  Solution 


After  completing  the  <  allocations,  the  new  /?,  and  1\}  elements  can  be  deter 
mined.  One  possible  arrangement  is  shown  in  Table  A- 17. 


Table  A- 17.  Third  Set  of  ft  and  A;  Variables 

requester  — ♦  1  2  3  4  ft  J. 

resource  j. 

i  IT  n  IT  -2 


After  determining  all  the  ft  and  A';,  the  AtJ  values  for  the  unassigned  cells  can 
be  calculated.  The  results  in  Table  A-18  show  that  and  A22  are  still  negative, 
which  requires  another  shuffle  of  the  assignment  using  the  0-path  of  Steps  2-6  and 
2-7.  Possible  results  of  performing  these  steps  are  given  in  Table  A-19. 


Table  A-18.  Third  Set  of  A,;  Variables 


requester  — » 
resource  [ 

1 

2 

3 

4 

ft  1 

1 

-  +3 

-1 

1 

-  +4 

-2 

7 

4 

3 

8 

2 

1 

-1 

e 

-  +4 

-3 

5 

5 

4 

9 

3 

e 

-  +4 

-  +8 

1 

0 

7 

"+8  r 


+4  0 


Reassigning  the  resources  according  to  the  0-path  constructed  in  Table  A- 
19  and  assigning  new  (  allocations  results  in  the  assignment  shown  in  Table  A-20. 
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Table  A- 19.  Third  0-Path  for  the  Most  Negative  AtJ 


requester  — » 
resource  j 

1 

2 

3 

4 

R,  1 

1 

-  +3 

.1 

1 

-  +4 

-2  1 

7 

4  +0 

3  -0 

8 

2 

1 

-1 

t 

-  +4 

-3 

5 

5 

4 

9 

3 

e 

-  +4 

-  +8 

1 

0 

2 

7 

9 

2 

4 

-  +8 

1 

e 

-  +4 

0 

10 

3  -0 

1  +0 

6 

Kj 

-2 

-3 

-1 

-2 

Possible  results  of  a  third  iteration  of  Steps  2-2,  2-3,  and  2-4  are  represented  by 
Table  A- 21 . 


Table  A-20.  Third  Iteration  Assignment 


requester  — * 
resource  j. 

1 

2 

3 

4 

R ,  1 

1 

- 

1 

t 

- 

7 

4 

3 

8 

2 

1 

- 

e 

- 

5 

5 

4 

9 

3 

c 

- 

- 

1 

2 

7 

9 

2 

4 

- 

- 

1 

- 

10 

3 

1 

6 

Kj  -* 

Upon  inspecting  the  Av  values  in  Table  A-21,  all  are  found  to  be  nonnega¬ 
tive.  This  means  an  optimal  solution  has  finally  been  obtained.  After  setting  the  c 
allocations  to  zero.  Table  A-22  summarizes  the  optimal  assignments. 
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Table  A-21.  Third  Iteration  R,,  A';,  and  At}  Values 


requester  — ► 
resource  j 

1 

2 

3 

4 

Ri  i 

1 

-  +3 

1 

£ 

-  +4 

+  i 

—t 

i 

4 

3 

8 

2 

1 

0 

£ 

-  +4 

0 

5 

5 

4 

9 

3 

£ 

+  5 

-  +8 

1 

+3 

2 

7 

9 

2 

4 

-  +8 

-  +1 

1 

-  +4 

+  3 

10 

3 

1 

6 

A',- 

-5 

-5 

-4 

-5 

Table  A-22.  Results  of  Transportation  Method 


Resource 

Requester 

1 

2 

2 

1 

3 

4 

4 

3 
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Appendix  B.  The  Hungarian  Method 


In  Chapter  3,  the  general  approach  of  Hungarian  method  for  solving  the  as¬ 
signment  problem  was  described.  In  this  appendix,  a  detailed  explanation  of  those 
steps  will  be  presented.  Then  an  example  problem  will  be  used  to  illustrate  how  the 
algorithm  operates. 

In  the  presentation  that  follows,  reference  to  the  rating  matrix  refers  to  the 
matrix  described  in  Section  3.1.2.  The  specific  steps  of  Hungarian  method  are  given 
in  the  following: 

1.  Find  the  minimum  element  in  each  row  of  the  rating  matrix  A0  and  subtract 
that  element  from  each  element  of  that  row.  Next,  find  the  smallest  element  in  each 
column  and  subtract  that  element  from  each  element  of  that  column.  The  resulting 
matrix  will  now  contain  at  least  one  null  element  in  each  row  and  column.  This  new. 
modified  matrix  will  simply  be  referred  to  as  the  matrix  in  later  references. 

2.  Locate  any  row  in  the  matrix  that  contains  only  one  null  element  and  suitably 
mark  the  null  element’s  position.  Cross  out  all  other  null  elements  in  the  column 
that  contains  this  marked  position.  Repeat  this  process  until  no  more  rows  can  be 
found  with  only  one  null  element  that  has  not  been  marked  or  crossed  out.  If  all  rows 
contain  a  marked  position,  then  these  positions  constitute  the  optimum  assignment. 
The  total  cost  of  the  assignment  can  be  found  by  summing  the  individual  costs  of 
the  corresponding  positions  in  the  original  matrix  Ao ■  Otherwise,  if  all  rows  do  not 
contain  a  marked  position,  then  go  to  step  3. 

3.  Locate  a  column  in  the  matrix  from  the  previous  steps  that  contains  only 
one  null  element.  Mark  this  position  and  cross  out  all  other  null  elements  in  the 
row  that  contains  this  newly  marked  position.  Repeat  this  process  until  no  more 
such  columns  can  be  found.  If  every  column  contains  a  marked  element,  then  these 


marked  positions  form  the  optimum  assignment  and  the  cost  can  be  calculated  as  in 
step  2.  Otherwise,  go  to  step  4. 

4.  Since  an  optimal  solution  has  not  yet  been  reached,  more  null  elements  must  be 
generated.  First,  the  minimum  set  of  lines  that  contain  or  cover  all  of  the  null  ele¬ 
ments  in  the  matrix  must  be  constructed.  By  disregarding  the  crossed  out  elements 
and  retaining  the  marked  elements  from  steps  2  and  3,  the  following  procedure  can 
be  used  to  draw  this  minimum  set  of  lines: 

4.1  Mark  the  rows  that  do  not  contain  any  marked  elements. 

4.2  Mark  the  columns  that  have  an  unmarked  null  element  in  a  marked  row. 

4.3  Mark  the  rows  that  have  a  marked  null  element  in  a  marked  column. 

4.4  Repeat  steps  4.2  and  4.3  until  no  more  rows  or  columns  can  be  marked. 

4.5  Draw  lines  through  ail  unmarked  rows  and  all  marked  columns. 

5.  All  elements  with  lines  drawn  through  them  are  “covered”  and  those  without 
lines  through  them  are  “uncovered.”  Find  the  smallest  uncovered  element  in  the 
matrix  and  subtract  this  element  from  all  uncovered  elements  in  the  matrix.  Then 
add  this  smallest  element  to  all  covered  elements  that  are  located  at  the  intersections 
of  the  lines  drawn  in  step  4.5  to  form  a  new  matrix.  If  all  elements  of  the  matrix  are 
covered,  then  this  indicates  the  optimum  assignment  has  been  reached  and  exists  in 
the  set  of  null  elements  in  the  present  matrix. 

This  step  is  a  result  of  the  Konig-Egervary  theorem  on  the  minimum  set  of 
covering  lines  [Kre68].  Its  objective  is  to  generate  additional  independent  zero  el¬ 
ements  to  be  covered  by  lines  in  later  iterations  of  the  algorithm.  As  defined  in 
Section  3.2.4,  independent  means  that  no  other  zero  elements  are  present  in  the 
same  row  and  column.  With  N  available  resources,  at  least  N  independent  zero 
elements  need  to  be  included  in  the  set  of  zero  elements  used  to  make  the  optimal 
assignment.  By  performing  Step  5,  the  cost  of  adding  these  additional  zero  elements 
to  the  solution  set  is  minimized.  This  step  reduces  all  of  the  uncovered  elements  by 


the  same  minimum  uncovered  amount,  increases  the  elements  covered  twice  by  the 
same  amount,  and  does  not  change  the  elements  covered  only  once.  This  procedure 
is  very  similar  to  the  simplex  method's  exchange  of  basic  and  nonbasic  variables 
explained  in  Sections  3.2.1,  3.2.2,  and  3.3.1. 

6.  Repeat  steps  2  through  5  until  the  optimum  assignment  is  found. 

As  an  example  of  Hungarian  method  just  presented,  again  consider  the  cost 
matrix  from  Appendix  A,  which  is  repeated  here  for  convenience: 

7  4  3  8 
5  5  4  9 
2  7  9  2 
10  3  1  6 

The  matrices  referred  to  in  the  algorithm  will  be  represented  as  tables  in  the 
following  presentation.  Using  the  steps  of  the  Hungarian  algorithm,  the  following 
tables  illustrate  the  procedure.  First,  the  minimum  row  elements  of  ,40  are  identified 
and  subtracted  to  yield  the  following  Table  B-l.  Then  the  minimum  column  elements 
of  Table  B-l  are  identified  and  subtracted  to  form  Table  B-2. 

Table  B-l.  Results  of  Subtracting  Minimum  Row  Elements 

TT2 


1 

IUDDB 

DDQB 

2 

3 

□BOB 

4 

IBBBB 

3 


resource  — > 
requester  j. 


\  s 


Table  B-2.  Results  of  Subtracting  Minimum  Column  Elements 


resource  — * 
requester  [ 

1 

2 

3 

4 

1 

T" 

IT 

~ 

~ 

2 

i 

0 

0 

5 

3 

0 

4 

7 

0 

4 

9 

1 

0 

5 

In  steps  2  through  5,  a  box  will  be  used  to  mark  the  single  null  elements  in  the 
rows  or  columns  and  an  ‘x’  to  cross  out  null  elements.  Performing  the  procedure  of 
Step  2  yields  Table  B-3. 


£ 


Table  B-3.  Independent  Null  Row  Elements 


resource  — » 

requester  j. 

1 

2 

3 

4 

1 

~4~ 

M 

fit 

IT 

2 

1 

fit 

8 

5 

3 

0 

4 

7 

0 

4 

9 

1 

ET 

5 

Rows  2  and  3  do  not  contain  a  boxed  element,  so  the  optimum  assignment  has 
not  been  reached.  Step  3  must  now  be  performed  and  one  possible  result  is  shown 
in  Table  B-4. 

One  note  of  explanation  is  needed  about  Table  B-4.  The  null  element  boxed  in  row  3. 
column  1  was  not  the  only  choice.  The  null  element  in  row  3,  column  4  could  have 
been  boxed  and  the  null  element  in  row  3,  column  1  crossed  out.  Recall  that  the  set 
of  null  elements  contains  at  least  one  optimal  assignment  and  possibly  more  than 
one. 

Checking  the  columns  containing  boxed  elements  in  Table  B-4  shows  that  col¬ 
umn  4  does  not  contain  a  boxed  element,  so  an  optimum  assignment  has  not  been 
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Table  3-4.  Independent  Null  Row  and  Column  Elements 


resource  — » 
requester  j 

1 

2 

3 

4 

1 

4 

Job 

k 

~ 

2 

1 

k 

5 

3 

0 

4 

7 

k 

4 

9 

1 

M 

5 

reached.  This  requires  the  generation  of  more  null  elements,  so  Step  4  must  be 
taken.  Only  one  row  does  not  contain  a  boxed  element,  and  Step  4.1  yields  Table  B- 
5.  Checking  for  columns  with  null  elements  in  a  marked  row  as  required  in  Step  4.2 
results  in  Table  B-6. 


Table  B-5.  Checked  Rows  Without  Boxed  Null  Elements 


resource  — * 

requester  [ 

1 

2 

3 

4 

Row 

Checks 

1 

4 

H 

~ 

2 

1 

0 

0 

5 

>/ 

3 

iL 

4 

7 

0 

4 

9 

1 

m 

5 

Column 

Checks 

Now.  using  Step  4.3  requires  rows  1  and  4  to  be  marked  since  they  contain 
a  boxed  null  element  in  a  marked  column.  The  results  are  shown  in  Table  B-7. 
Rechecking  Steps  4.2  and  4.3,  as  required  by  Step  4.4,  reveals  that  no  other  rows  or 
columns  can  be  marked.  Now  Step  4.5  can  be  followed,  which  calls  for  lines  to  be 
drawn  through  all  unmarked  rows  and  all  marked  columns  to  form  the  set  of  covering 
lines.  The  covering  lines  are  illustrated  by  the  asterisks  at  either  end  of  a  row  or 
column  in  the  following  Table  B-8. 


Table  B-6.  Checked  Columns  with  Null  Elements  in  Checked  Rx)w^ 


resource  — ► 

requester  J. 

1 

— 

2 

3 

H 

Row 

Checks 

1 

4 

IEII! 

0 

lT 

2 

1 

3 

0 

B 

V 

3 

!□! 

in 

7 

□ 

4 

9 

l 

IEII 

B 

Column 

Checks 

■ 

y/ 

v/ 

Table  B-7.  Checked  Rows  with  Boxed  Null  Elements  in  Checked  Colu 


resource  — ► 
requester  j. 

1 

2 

3 

4 

Row 

Checks 

1 

4 

IEII 

0 

B 

V 

2 

1 

0 

0 

a 

V 

3 

»□! 

4 

n 

ID 

4 

1 

IEII 

B 

V 

Column 

Checks 

■ 

V 

I 

Table  B-8.  A  Minimum  Set  of  Covering  Lines 


Step  5  requires  that  the  minimum  uncovered  element  be  subtracted  from  each 
uncovered  element  and  added  to  the  elements  that  lie  at  the  intersections  of  the 
covering  lines.  From  the  previous  diagram,  the  minimum  uncovered  element  is  1. 


Subtracting  this  from  the  proper  elements  results  in  Table  B-9. 


Table  B-9.  New  Table  From  Step  5 


resource  — ► 
requester  j 

1 

2 

3 

4 

1 

~r 

T" 

T 

T" 

2 

0 

0 

0 

4 

3 

0 

5 

8 

0 

4 

8 

1 

o 

4 

Performing  step  2  again  yields  Table  B-10.  Now  there  is  a  boxed  element  in 
each  row  of  Table  B-10,  so  the  optimum  assignment  has  been  reached.  In  this  case, 
the  resulting  assignments  are  shown  in  Table  B-ll.  Note  that  this  table  is  identical 
to  Table  A-23  obtained  in  the  transportation  method  example  in  Appendix  A. 


Table  B-10.  Independent  Null  Row  and  Column  Elements 


resource  — ► 

requester  j. 

1 

2 

3 

4 

1 

3 

M 

a 

4 

2 

0 

M 

a 

4 

3 

H 

5 

8 

M 

4 

8 

1 

0 

5 

The  cost  of  this  assignment  shown  in  Table  B-ll  can  be  obtained  by  summing 
the  corresponding  costs  in  the  original  rating  matrix  A0  which  gives 


The  sum  of  all  the  minimum  elements  found  in  steps  1  and  5  should  equal  this 
assignment  cost  since  these  minimum  elements  represent  the  costs  associated  with 
each  intermediate  assignment.  Summing  these  values  yields 

3  +  44-2+1  +  1  +  1  =  12 

The  cost  results  of  the  Hungarian  method  are  also  identical  to  the  transporta¬ 
tion  example,  as  expected.  The  example  just  presented  was  a  minimization  of  the 
assignment  cost.  It  could  have  been  transformed  into  a  maximization  of  the  assign¬ 
ment  cost  by  modifying  the  original  rating  matrix  as  follows: 

0.1  Find  the  maximum  element  in  the  rating  matrix.  Create  a  new  matrix 
C0  =  ||c0||  by  individually  subtracting  each  element  in  the  cost  matrix  from  the 
value  of  the  maximum  element  and  store  the  difference  in  the  corresponding  location 
in  C0- 


cX]  =  Max(atJ)  -  atJ 


(B  —  7) 


One  other  variation  would  be  the  case  where  the  number  of  requesters  and 
resources  were  not  equal.  In  this  case,  the  rating  matrix  would  not  be  square  as  it 
must  be  for  the  original  Hungarian  algorithm  to  work.  This  can  be  taken  care  of 
by  adding  “dummy”  resources  or  requesters  with  rating  values  of  zero  [Ign82].  This 
completes  the  example  and  discussion  of  the  Hungarian  method. 
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Appendix  C.  Additional  Results 

This  appendix  lists  all  of  the  remaining  results  of  the  implementations  de¬ 
scribed  in  this  report.  All  entries  marked  with  a  *  are  more  than  1 09t  in  error  due 
to  the  accuracy  of  the  timing  function  of  the  iPSC. 

Table  C-l.  Timing  and  Speedups  of  the  Level  1  Implementation  (32  Wpns) 


Processors  Time  (sec) 
FI  1.2750“ 
2  0.2210 

4  0.0523 

8  *0.0245 

16  *0.0116 

32  *0.0061 

1  0.9100 

2  0.4320 

4  0.2138 

8  0.1070 

16  0.0551 

32  *0.0285 

1  1.7090 

2  0.84C0 

4  0.4215 

8  0.2135 

16  0.1094 

32  0.0572 


Sbll 

Too" 

5.77 

24.38 

*52.04 

*109.91 

*209.02 

1.00 

2.11 

4.26 

8.50 

16.52 

*31.93 

1.00 

2.03 

4.05 

8.00 

15.62 

29.88 


1.84 

10.64 

44.97 

*96.00 

*202.76 

*385.57 

23.84 

50.22 

101.46 

202.74 

393.70 
*761.16 

29.88 
60.78 
121  13 
239.14 

466.70 
892.60 


, 


Table  C-2.  Timing  and 
|  Weap  |  Targ 
I  6?~|  64~ 


64  320 

64  320 

64  320 

64  320 

64  320 

64  320 

64  640 

64  640 

64  640 

64  640 

64  640 

64  640 


Speedups  of  the  Level 
Processors  Time  (sec) 
T]  4.1830 

2  0.6450 

4  0.2143 

8  0.0959 

16  *0.0447 

32  *0.0224 

1  3.5310 

2  1.6940 

4  0.8408 

8  0.4225 

16  0.2139 

32  0.1095 

1  6.7820 

2  3.3800 

4  1.6943 

8  0.8515 

16  0.4308 

32  0.2200 


1  Implementation  (64  Wpns) 


1.00 

6.49 

19.52 

43.62 

*93.58 

*186.74 

1.00 

2.08 

4.20 

8.36 

16.51 

32.25 

1.00 

2.01 

4.00 

7.96 

15.74 

30.83 


3.96 
25  71 
77.37 
172  90 
*370.94 
*740.22 
37.88 
77.93 
157.01 
312.46 
617.19 
1205  62 
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Table  C-5.  Timing  and  Speedups  of  the  Level  2  Implementation  (64  Wpns 


Weap 

Targ 

64 

64 

64 

64 

64 

64 

64 

64 

64  . 

64  320 

64  320 

64  320 


64 

640 

64 

640 

64 

640 

table  C-6.  Timing  and  Speedups  of  the  Level  2  Implementation 


C'ntrl  1  Proc/Cntrl 


Tot  Proc 

Time  (sec) 

6 

1.009 

10 

0.527 

18 

0.359 

12 

0.340 

20 

0.244 

24 

0  132 

6 

3.668 

10 

1.903 

18 

1  033 

12 

1812 

20 

0.932 

24 

0.902 

6 

8.378 

10 

4  283 

18 

2  237 

12 

4.195 

20 

2  145 

24 

2  122 

14  43 
27.62 
40.54 
42.81 
59.65 
110.27 


128  Wpm 

•8  So  rl 
100.11 
191.67 
281.37 
297.09 
413.98 


VNeap  |  Targ  |  Cntrl 


128  128 
128  128 
128  128 
128  128 


28 
28 
128 
128 
128 
128  I  G40 


28  1280 
28  1280 
128  1280 
128  1280 
128  1280 
128  1280 


Proc/Cntrl  I  Tot  Proc 
6 
10 
18 
12 
20 
24 


vel  3  Implementation  ( 12S  \V 

oc  I  Time  (sec)  I  1  I 


1  5355 
3.0785 
5  6275 
3  3035 
3.8353 
2.1064 
4.5330 
7.8575 
1  1  3755 
7.2560 
8.5495 
4.9200 
8.7008 
14.9796 
20.3846 
13.4023 
15  3583 
8.9406 


,evel  4  Implementation  (32  \V 


VVeap 

Targ 

Processors 

32 

32 

2 

32 

32 

4 

32 

32 

8 

32 

32 

16 

32 

32 

32 

32 

2 

32 

160 

4 

32 

160 

8 

32 

160 

16 

32 

160 

32 

32 

Hi 

2 

32 

Wm 

4 

32 

320 

8 

32 

320 

16 

Time  (sec) 

1  64 1" 
1713 
1  934 
3.918 
12.798 
1.513 
1.285 
2.176 
7  354 
21.327 
2.340 
1.461 
2.091 
7.654 


Table  C-ll.  Timing  and  Speedups  of  the  Level  4  Implementation  (64  Wpns) 


Targ  Processors  Time  (sec) 


Table  C-15.  Assignment  Results  of  the  Level  1  Implementation  (128  VVpns) 


Weap 

Targ 

Processors 

Cost 

%  Wasted 

128 

128 

1 

2158.4 

100.0 

128 

128 

2 

1668.8 

74.4 

128 

128 

4 

1601.6 

68.1 

128 

128 

8 

1566.4 

65.2 

34.8 

128 

128 

16 

1558.4 

64.1 

35.9 

128 

128 

32 

1552.0 

63.8 

36.2 

128 

1 

mtmm 

128 

640 

2 

990.4 

9.4 

128 

640 

4 

15.0 

128 

640 

8 

82.7 

17.3 

128 

640 

16 

81.7 

18.3 

128 

640 

32 

81.1 

18.9 

1 

937.6 

100.0 

0.0 

2 

89.7 

10.3 

128 

4 

936.0 

85.3 

14.7 

128 

1280 

8 

936.0 

82.2 

17.8 

128 

1280 

16 

936.0 

80.9 

19.1 

128 

32 

936.0 

80.5 

19.4 

Table  C-16.  Assignment  Results  of  the  Level  2  Implementation  (32  Wpns) 


Weap 


32 

32 

32 

32 

32 

32 


32 

32 

32 

32 

32 

32 


32 

32 

32 

32 

32 

32 


Targ 


32 

32 

32 

32 

32 

32 


160 

160 

160 

160 

160 

160 


320 

320 

320 

320 

320 

320 


Cntrl 


Proc/CntiT 


Tot  Proc 


6 

10 

18 

12 

20 

24 


6 

10 

18 

12 

20 

24 


6 

10 

18 

12 

20 


Cost 


2123.6 

1008.0 

915.2 

1406.4 

1296.0 

1412.8 


368.0 

361.6 

355.2 

369.6 

363.2 
369.6 


272.0 

272.0 

272.0 

276.8 


%  Effective 


68.1 

65.6 

64.4 

65.6 

64.4 

64.4 


92.5 
91.2 

90.6 
91.2 
90.6 
90.6 


93.8 

93.8 

93.8 

93.8 

93.8 

93.8 


%  Idle 


16.3 

20.0 

21.9 

5.0 

6.9 

2.5 


1.9 

3.8 
5.0 

1.9 
3.1 
1.3 


1.9 

1.9 

1.9 

0.0 

0.0 

0.0 


7  Wasted 


15.6 

14.4 

13.7 

29.4 

28.7 
33.1 


0.6 

5.0 

4.1 
6.9 
6.3 

8.1 


4.4 
4  4 
4.4 
6.3 
6.3 
6.3 


8 


276.8 

276.8 


Table  C-17.  Assignment  Results  of  the  Level  2 


Weap 

6?" 

64 

64 

64 

64 

64 

64 

64 

64 

64 

64 

64 

64 

64 

64 

64 

64 

64 


Cntrl  Proc/Cntrl  Tot  Proc  Cost 

^  Y  T  Y  1158.4 

2  4  10  1099.2 

2  8  18  1033.6 

4  2  12  1236.8 

4  4  20  1171.2 

8  2  24  1220.8 

"  2  6“  523.2 

2  4  10  520.0 

2  8  18  518.4 

4  2  12  539.2 

4  4  20  537.6 

8  2  24  544.0 

2  2  6  476.8 

2  4  10  468.8 

2  8  18  460.8 

4  2  12  478.4 

4  4  20  470.4 

8  2  24  478.4 


Implementation  (64  Wpns) 
Effective  j  %  Idle  I  %  Wasu 


•I  Wasted 
17  9 


Table  C-18 
Weap  Targ 
128  128~ 
128  128 
128  128 
128  128 
128  128 
128  128 
128  640 

128  640 

128  640 

128  640 

128  640 

128  640 

128  1280 
128  1280 
128  1280 
128  1280 
128  1280 
128  1280 


Assignment  Results  of  the  Level  2 

Cntrl  Proc/Cntrl  Tot  Proc  Cost 
2~|  Y\  6  1361.6 

2  4  10  1259.2 

2  8  18  1203.2 

4  2  12  1446.4 

4  4  20  1372.8 

8  2  24  1483.2 

2  2  6  928.0 

2  4  10  902.4 

2  8  18  889.6 

4  2  12  963.2 

4  4  20  948.8 

8  2  24  976.0 

2  2  6  881.6 

2  4  10  846.4 

2  8  18  833.6 

4  2  12  900.8 

4  4  20  884.8 

8  2  24  920.0 


Implementation  (128  Wpns) 
%  Effective  I  %  Idle  I  %  Wasted 


'-V'V-Y-'W 


A 


Table  C-19  Assignment  Results  of  the  Level  3  Implementation  (32  Wpns 


■WTCTOfiTOBl 


Tot  Proc  Cost 


64 

64 

iStil 

64 

64 

64 

64 

64 

64 

64 

64 

mm 

64 

320 

64 

320 

64 

320 

64 

320 

I 

1 


Effective 


Table  C-20.  Assignment  Results  of  the  Level  3  Implementation  (64  Wpns) 


Weap  |  Targ  Cntrl  |  Proc/Cntrl  Tot  Proc  |  Cost 


6 

10 

18 

12 

20 

24 


6 

10 

18 

12 

20 

24 


6 

10 

18 

12 

20 

24 


2881.6 

1960.0 


9766.4 

76.9 

23  1 

10984.0 

75.0 

25.0 

10337.6 

73.8 

26.2 

4660.8 

70.3 

29.7 

7102.4 

69.1 

30.9 

4192.0 

68.4 

31.6 

4758.0 

93.4 

6.6 

5696.0 

92.2 

7.8 

5928.0 

92.2 

7.8 

1764.0 

89.8 

10.2 

1972.0 

89.8 

10.2 

1222.0 

89.1 

10.9 

1780.8 

92.5 

7.5 

91.9 

8.1 
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ABSTRACT 


The  process  of  effectively  coordinating  and  controlling  resources 
during  a  military  engagement  is  known  as  Battle  Management/  Command, 
Control,  and  Communications  (BM/C3).  One  key  task  of  BM/C3  is  allocating 
weapons  to  destroy  targets.  The  focus  of  this  research  is  on  developing 
parallel  methods  to  achieve  fast  and  cost  effective  assignment  of  weapons 
to  targets.  Using  the  sequential  Hungarian  method  for  solving  the 
assignment  problem  as  a  basis,  this  report  presents  the  development  of 
four  parallel  assignment  algorithms  implemented  on  the  Intel  iPSC  hypercube 
computer . 

The  first  approach  partitions  the  problem  space  into  smaller, 
independent  sub-problems  and  assigns  each  to  a  processing  node  the 
hypercube.  The  second  and  third  approaches  also  partition  the  iblem 
space  but  they  assign  each  partition  to  a  group  of  processing  .des. 

Each  group  is  controlled  by  a  separate  node  which  further  subdivides 
the  partition  among  members  of  the  group.  In  the  second  approa  h,  the 
control  node  acts  as  an  arbitrator  to  eliminate  the  redundant  assignment 
of  weapons  by  selecting  the  least  costly  weapon  allocation  and  idling 
the  more  costly  redundant  allocations.  The  third  approach  eliminates 
redundant  weapon  allocations  by  also  selecting  the  least  costly  weapon 
allocations,  but  directs  additional  processing  to  reallocate  the  more 
costly  weapons.  The  fourth  approach  is  a  parallel  implementation  of  the 
Hungarian  method,  where  certain  subtasks  of  the  algorithm  are  performed 
in  parallel.  This  approach  produces  an  optimal  assignment  instead  of  the 
sub-optimal  assignment  generally  obtained  using  either  of  the  three 
heuristic  methods. 

The  relative  performance  of  the  four  approaches  is  compared  by 
varying  the  number  of  weapons  and  targets,  the  number  of  processors, 
and  the  size  of  the  problem  partitions.  The  first  and  second  approaches 
produce  significantly  faster  assignment  solutions  than  those  possible  with 
the  baseline  sequential  methods.  The  third  and  fourth  approaches  yield 
slower  solutions,  but  are  still  faster  than  sequential  methods  of 
assignment . 


vs 


